LNCS2151 



I Albertas Caplinskas 
Johann Eder (Eds.) 



Advances in Databases 
and Information Systems 



5th East European Conference, ADBIS 2001 
Vilnius, Lithuania, September 2001 
Proceedings 




Springer 




Lecture Notes in Computer Science 2151 

Edited by G. Goos, J. Hartmanis, and J. van Leeuwen 




Springer 

Berlin 

Heidelberg 

New York 

Barcelona 

Hong Kong 

London 

Milan 

Paris 

Tokyo 




Albertas Caplinskas Johann Eder (Eds.) 



Advances 
in Databases and 
Information Systems 



5th East European Conference, ADBIS 2001 
Vilnius, Lithuania, September 25-28, 2001 
Proceedings 




Springer 




Series Editors 



Gerhard Goos, Karlsruhe University, Germany 
Juris Hartmanis, Cornell University, NY, USA 
Jan van Leeuwen, Utrecht University, The Netherlands 

Volume Editors 
Albertas Caplinskas 

Institute of Mathematics and Informatics 
Akademijos st. 4, 2600 Vilnius, Lithuania 
E-mail: alcapl@ktl.mii.lt 
Johann Eder 

University of Klagenfurt, Department of Informatics Systems 
Universitatsstr. 65, 9020 Klagenfurt, Austria 
E-mail: eder@isys.uni-klu.ac.at 



Cataloging-in-Publication Data applied for 
Die Deutsche Bibliothek - CIP-Einheitsaufnahme 

Advances in databases and information systems : 5th East European conference 
; proceedings / ADBIS 2001, Vilnius, Lithuania, September 25 - 28, 2001 . 
Albertas Caplinskas ; Johann Eder (ed.). - Berlin ; Heidelberg ; New York ; 
Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2001 
(Lecture notes in computer science ; Vol. 2151) 

ISBN 3-540-42555-1 



CR Subject Classification (1998): H.2, H.3, H.4, H.5, J.l 
ISSN 0302-9743 

ISBN 3-540-42555-1 Springer- Verlag Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer- Verlag. Violations are 
liable for prosecution under the German Copyright Law. 

Springer- Verlag Berlin Heidelberg New York 
a member of BertelsmannSpringer Science+Business Media GmbH 

http ://w w w. springer, de 

© Springer-Verlag Berlin Heidelberg 2001 
Printed in Germany 

Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Stefan Sossna 
Printed on acid-free paper SPIN 10845509 06/3142 5 4 3 2 1 0 




Preface 



ADBIS 2001 continued the series of Eastern European Conferences on Advances 
in Databases and Information Systems. It took place in Vilnius, the capital of 
Lithuania. 

The ADBIS series already has established some reputation as a scientific 
event of high quality serving as an internationally highly visible showcase for 
research achievements in the field of databases and information systems in the 
Eastern and Central European region as well as a forum for communication 
and exchange with and among researchers from this very area. The topics of 
the conference were related to the European Commission’s initiative “eEurope - 
Information Society for All” . They represent research areas that will greatly in- 
fluence the functionality, usability, and acceptability of future information prod- 
ucts and services. The conference sought to bring together all major players 
(experienced researchers, young researchers, representatives from industry), to 
discuss the state of the art and practical solutions, and to initiate promising fu- 
ture research. In order to ensure the active participation of representatives from 
industry the conference provided special industrial sessions. 

The theme of ADBIS 2001 was set with invited talks given by renowned 
scientists covering important aspects of this wide field of research: Web access, 
processes and services in e-commerce, and content management. 

The call for papers attracted 82 submissions from 30 countries. In a rigorous 
reviewing process the international program committee selected 25 papers for 
long presentation and inclusion in these proceedings. All these papers have been 
reviewed by at least three reviewers who evaluated their originality, significance, 
relevance, and presentation and found their quality suitable for international 
publication. 

Topically, the accepted papers span a wide spectrum of the database and 
information systems field: from query optimization and transaction processing 
via design methods to application oriented topics like XML and data on the web. 

Furthermore, the program featured two tutorials, presentations of short pa- 
pers with challenging novel ideas at an early stage, experience reports, and in- 
dustrial papers on the application of databases and information systems. 

We would like to express our thanks and acknowledgement to all the people 
who contributed to ADBIS 2001: 

— the authors, who submitted papers to the conference, 

— the reviewers and the international program committee who made this con- 
ference possible by voluntarily giving of their time and expertise fto ensure 
the quality of the scientific program 

— the sponsors of ADBIS 2001 who supported the organization of the confer- 
ence and permitted the participation in particular of young scientists from 
Eastern countries 

— all the good spirits and helpful hands of the local organization committee 




VI 



Preface 



— Mr. Giinter Millahn (Brandenburg University of Technology at Cottbus) for 
maintaining the conference server 

— the Springer- Verlag for publishing these proceedings and Mr. Alfred Hof- 
mann for the effective support in producing these proceedings 

— and last but not least we thank the steering committee and, in particular, 
its chairman, Leonid Kalinichenko, for their advice and guidance. 

Finally, we hope that all who contributed to this event see this volume as a 
reward and as representation of scientific spirit and technical progress commu- 
nicating exciting ideas and achievements. 



June 2001 Albertas Caplinskas and Johann Eder 




Conference Organization 



The ADBIS 2001 Conference on Advances in Databases and Information Systems 
was organized by the Vilnius Gediminas Technical University, the Institute of 
Mathematics and Informatics, Vilnius, and the Lithuanian Computer Society in 
cooperation with the ACM Sigmocl Moscow Chapter and the Lithuanian Law 
University. 



General Chair 

Edmundas Zavadskas (Vilnius Gediminas Technical University, Lithuania) 

Program Committee Co-chairs 

Albertas Caplinskas (Institute of Mathematics and Informatics, Lithuania) 
Johann Eder (University of Klagenfurt, Austria) 

Program Committee 

Suad Alagic (Wichita State University, USA) 

Leopoldo Bertosi (Pontificia Universidad Catoloca de Chile, Chile) 

Juris Borzovs (Riga Information Technology Institute, Latvia) 

Omran A. Bukhres (Purdue University, USA) 

Wojciech Cellary (Poznan University of Economics, Poland) 

Bohdan Czejdo (Loyola University, USA) 

Hans-Dieter Ehrich (Braunschweig Technical University, Germany) 

Heinz Frank (University of Klagenfurt, Austria) 

Remigijus Gustas (University of Karlstad, Sweden) 

Tomas Hruska (Brno Technical University, Slovakia) 

Yoshilraru Ishikawa (University of Tsukuba, Japan) 

Leonid Kalinichenko (Institute for Problems of Informatics, 

Russian Academy of Sciences, Russia) 

Wolfgang Klas (Ulm University, Germany) 

Matthias Klusch (German Research Center for Artificial Intelligence GmbH, 
Germany) 

Mikhail R. Kogalovsky (Market Economy Institute, 

Russian Academy of Sciences, Russia) 

Kalle Lyytinen (University of Jyvaskyla, Finland) 

Yanis Manolopoulos (Aristotle University, Greece) 

Miclrail Matskin (Norwegian University of Science and Technology, Norway) 
Tomaz Molroric (Lublijana University, Slovenia) 

Tadeusz Morzy (Poznan University of Technology, Poland) 

Pavol Navrat (Slovak University of Technology, Slovakia) 




VIII Conference Organization 



Nikolay N. Nikitchenko (Kiev University, Ukraine) 

Boris Novikov (University of St.- Peterburg, Russia) 

Maria Orlowska (The University of Queensland, Australia) 

Euthimios Panagos (voicemate, Inc, USA) 

Bronius Paradauskas (Kaunas University of Technology, Lithuania) 
Oscar Pastor Lopez (Universidad Politecnica de Valencia, Spain) 

Jaan Penjam (Tallinn Technical University, Estonia) 

Gunther Pernul (University of Essen, Germany) 

Jaroslav Pokorny (Charles University, Czech Republic) 

Henrikas Pranevichius (Kaunas University of Technology, Lithuania) 
Colette Rolland (University of PARIS- 1 Pantheon/Sorbonne, France) 
Klaus-Dieter Schewe (Technical University Clausthal, Germany) 
Timothy K. Shih (Tamkang University, Taiwan) 

Julius Stuller (Institute of Computer Science, Academy of Sciences 
of the Czech Republic, Czech Republic) 

Kazimierz Subieta (Polish Academy of Science, Poland) 

Bernhard Tlralheim (Cottbus Technical University, Germany) 

Aphrodite Tsalgatidou (University of Athens, Greece) 

Enn Tyugu (Royal Institute of Technology, Sweden) 

Gottfried Vossen (University of Munster, Germany) 

Benkt Wangler (Stockholm University, Sweden) 

Tatjana Welzer Druzoviec (Maribor University, Slovenia) 

Viaclreslav Wolfengagen (Moscow Engineering Physics Institute, Russia) 
Vladimir I. Zadorozhny (University of Maryland, USA) 

Alexandr Vasil’evich Zamulin (Institute of Informatics Systems, 

Russian Academy of Sciences, Russia) 



Additional Referees 

Dmitry Briuklrov 
Marjan Druzovec 
Gintautas Dzemyda 
Dale Dzemydiene 
Vicente Pelechano Ferragud 
Chris A. Freyberg 
Antonio Grau 
Saulius Gudas 
Joanna Jozefowska 
Dimitrios Katsaros 
Christian Koncilia 
Antanas Lipeika 
Audrone Lupeikiene 
Olivera Marjanovic 
Saulius Maskeliunas 



Karl Neumann 
Emilio Insfran 
Pelozo Costas Petrou 
Ralf Pinger 
Torsten Priebe 
Shazia Sadiq 
George Samaras 

Srinivasan T. Sikkupparbathyam 

Cezary Sobaniec 

Eva Soderstrbm 

Jerzy Stefanowski 

Matias Strand 

Eleni Tousidou 

Costas Vassilakis 

Robert Wrembel 

Jian Yang 




Conference Organization 



IX 



Organizing Committee 

Chair 

Olegas Vasilecas (Vilnius Gediminas Technical University, Lithuania) 



Vice-Chairs 

Algimantas Ciucelis (Vilnius Gediminas Technical University, Lithuania) 
Alfredas Otas (Lithuanian Computer Society, Lithuania) 

Rimantas Petrauskas (Lithuanian Law University, Lithuania) 



Members 

Petras Adomenas (Vilnius Gediminas Technical University, Lithuania) 

Danute Burokiene (Institute of Mathematics and Informatics, Lithuania) 

Dale Dzemydiene (Institute of Mathematics and Informatics, Lithuania) 

Milda Garmute (Vilnius Gediminas Technical University, Lithuania) 

Audrius Klevas (Vilnius Gediminas Technical University, Lithuania) 

Kristina Lapin (Vilnius University, Lithuania) 

Audrone Lupeikiene (Institute of Mathematics and Informatics, Lithuania) 
Saulius Maskeliunas (Institute of Mathematics and Informatics, Lithuania) 
Guenter Millahn (Brandenburg University of Technology at Cottbus, Germany) 
Arunas Ribikauskas (Vilnius Gediminas Technical University, Lithuania) 
Danute Vanseviciene (Institute of Mathematics and Informatics, Lithuania) 
Aldona Zaldokiene (Institute of Mathematics and Informatics, Lithuania) 

ADBIS Steering Committee 

Chair 

Leonid Kalinichenko (Russia) 

Members 

Andras Benczur (Hungary) 

Radu Bercaru (Romania) 

Albertas Caplinskas (Lithuania) 

Johann Eder (Austria) 

Janis Eiduks (Latvia) 

Hele-Mai Haav (Estonia) 

Mirjana Ivanovic (Yugoslavia) 

Mikhail Kogalovsky (Russia) 

Yannis Manopoulos (Greece) 



Rainer Manthey (Germany) 
Tadeusz Morzy (Poland) 

Pavol Navrat (Slovakia) 

Boris Novikov (Russia) 

Jaroslav Pokorny (Czech Republic) 
Boris Rachev (Bulgaria) 

Anatoly Stogny (Ukraine) 

Tatjana Welzer (Slovenia) 
Viacheslav Wolfengagen (Russia) 




X 



Conference Organization 



Sponsoring Institutions 

European Commission, Research DG, Human Potential Programme, 
High-level Scientific Conferences (subject to contract) 

Lithuanian Science and Studies State Foundation 
Microsoft Research Ltd. 




Table of Contents 



Invited Papers 

Ubiquitous Web Applications 1 

Franca Garzotto 

From Workflows to Service Composition in Virtual Enterprises 2 

Marek Rusinkiewicz 

Subject-Oriented Work: Lessons Learned from an Interdisciplinary 

Content Management Project 3 

Joachim W. Schmidt, Hans- Werner Sehring, Michael Skusa, 

Axel Wienberg 



Regular Papers 



Query Optimization 

Query Optimization through Removing Dead Subqueries 27 

Jacek Plodzieh, Kazimierz Subieta 

The Impact of Buffering on Closest Pairs Queries Using R-Trees 41 

Antonio Corral, Michael Vassilakopoulos, Yannis Manolopoulos 

Enhancing an Extensible Query Optimizer with Support for Multiple 

Equivalence Types 55 

Giedrius Slivinskas, Christian S. Jensen 

Multimedia and Multilingual Information Systems 

Information Sources Registration at a Subject Mediator as Compositional 

Development 70 

Dmitry O. Briukhov, Leonid A. Kalinichenko, Nikolay A. Skvortsov 

Extracting Theme Melodies by Using a Graphical Clustering Algorithm 

for Content-Based Music Information Retrieval 84 

Yong-Kyoon Kang, Kyong-I Ku, Yoo-Sung Kim 

A Multilingual Information System Based on Knowledge Representation . . 98 
Catherine Roussey, Sylvie Calabretto, Jean-Marie Pinon 




XII 



Table of Contents 



Spatiotemporal Aspects of Databases 

Capturing Fuzziness and Uncertainty of Spatiotemporal Objects 112 

Dieter Pfoser, Nectaria Tryfona 

Probability-Based Tile Pre-fetching and Cache Replacement Algorithms 

for Web Geographical Information Systems 127 

Yong-Kyoon Kang , Ki- Chang Kim, Yoo-Sung Kim 

Data Mining 

Optimizing Pattern Queries for Web Access Logs 141 

Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewics 

Ensemble Feature Selection Based on Contextual Merit and Correlation 

Heuristics 155 

Seppo Puuronen, Iryna Skrypnyk, Alexey Tsymbal 

Interactive Constraint-Based Sequential Pattern Mining 169 

Marek Wojciechowski 

Transaction Processing 

Evaluation of a Broadcast Scheduling Algorithm 182 

Murat Karakaya, Ozgiir Ulusoy 

An Architecture for Workflows Interoperability Supporting Electronic 

Commerce 196 

Vlad Ingar Wietrzyk, Makoto Takizawa, Vijay Khandelwal 

Object and Log Management in Temporal Log-Only Object Database 

Systems 210 

Kjetil N0rvag 

Conceptual Modeling. Information Systems Specification 

Operations for Conceptual Schema Manipulation: Definitions 

and Semantics 225 

Helle L. Christensen , Mads L. Haslund, Henrik N. Nielsen, 

Nectaria Tryfona 

Object-Oriented Database as a Dynamic System with Implicit State 239 

Kazem Lellahi, Alexandre Zamulin 

The Use of Aggregate and Z Formal Methods for Specification and Analysis 

of Distributed Systems 253 

Henrikas Pranevicius 




Table of Contents XIII 



Active Databases 

Detecting Termination of Active Database Rules Using Symbolic Model 

Checking 266 

Indrakshi Ray, Indrajit Ray 

Querying Methods 

A Data Model for Flexible Querying 280 

Jaroslav Pokorny, Peter Vojtas 

The Arc-Tree: A Novel Symmetric Access Method for Multidimensional 

Data 294 

Dimitris G. Kapopoidos, Michael Hatzopoulos 

Evaluation of Join Strategies for Distributed Mediation 308 

Vanja Josifovski, Timour Katchaounov, Tore Risch 

XML 

An RMM-Based Methodology for Hypermedia Presentation Design 323 

Flavius Frasincar, Geert. Jan Houben, Richard Vdovjak 

Efficiently Mapping Integrity Constraints from Relational Database to 

XML Document 338 

Xiaochun Yang, Ge Yu, Guoren Wang 

A Web-Based System for Handling Multidimensional Information through 

MXML 352 

Manolis Gergatsoulis, Yannis Stavrakas, Dimitris Karteris, 

Athina Mouzaki, Dimitris Sterpis 

Information Systems Design 

An Abstract Database Machine for Cost Driven Design of Object-Oriented 

Database Schemas 366 

Joachim Biskup, Ralf Menzel 

Author Index 381 




Ubiquitous Web Applications 



Franca Garzotto 



HOC- Hypermedia Open Center 
Department of Electronics and Information 
Politecnico di Milano, Italy 
garzottoOelet . polimi . it 



Abstract. Web sites are progressively evolving from browsable, read- 
only information repositories that exploit the web to interact with their 
users, to web-based applications, combining navigation and search capa- 
bilities with operations and transactions typical of information systems. 
In parallel, the possibility of accessing web-based contents and services 
through a number of different devices, ranging from full-fledged desktop 
computers, to Personal Digital Assistants (PDA’s), to mobile phones, to 
set-top boxes connected to TV’s, makes web applications ubiquitous, i.e. , 
accessible anywhere at any time. 

Ubiquitous Web Applications (UWA’s for short) reveal a number of as- 
pects which make them different with respect to a conventional data- 
intensive applications, and must be taken into account throughout the 
whole application lifecycle, from requirements to implementation. UWA’s 
are executed in a Web-based environment, where the paradigm for pre- 
senting and accessing information is hypermedia-like. Thus UWA’s have 
a mixed nature - hypermedia and transactional, where hypertext struc- 
tures and operation capabilities are strongly intertwined. In addition, the 
ubiquitous nature of a UWA implies that the application has to take into 
account the different constraints of different devices, comprising display 
size, local storage size, method of input and computing speed as well as 
network capacity. At the same time, ubiquity introduces new require- 
ments on how the application tunes itself to the end user: each user 
may wish to get information, navigation patterns, lay-out, and services, 
that are tailored not only to his/her specific profile but also to the cur- 
rent situation of use, in its temporal and environmental aspects. Thus 
Ubiquitous Web Applications must be at the same time device-aware, 
user-aware, and context-of-use-aware, and require sophisticate forms of 
customization. 

After an analysis of the novel requirements of UWA’s, this talk will focus 
on their impact on the design process, and will discuss problems and 
challenges related to modeling information and navigation structures, 
operations and transactions, and customization mechanisms for this class 
of applications. 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, p. 1, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




From Workflows to Service Composition in 
Virtual Enterprises 



Marek Rusinkiewicz 

Information and Computer Science Research Laboratory 
Telcordia Technologies 
Austin, TX, USA 
marekSresearch.telcordia. com 



Abstract. Workflow technologies have been used extensively to support 
process-based integration of activities within enterprises and are at the 
core of emerging Enterprise Integration Platform (EIP) technologies. Re- 
cently, a new abstraction of electronic services has begun to receive a lot 
of attention among researchers and in the vendor community. Electronic 
services provide the basis for creation of virtual enterprises (VE), which 
combine services from multiple providers. In this talk, we will discuss the 
advances that are needed to provide support for VEs. These include the 
ability to advertise, broker, synchronize and optimize the services. We 
will discuss the infrastructure needed to support VE application, show 
a prototype demo, and list important research problems that need to be 
solved. 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, p. 2, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




Subject-Oriented Work: Lessons Learned from an 
Interdisciplinary Content Management Project 



Joachim W. Schmidt, Hans-Werner Sehring, Michael Skusa, and Axel Wienberg 

Technical University TUHH Hamburg, Software Systems Institute, 

Harburger SchloBstraBe 20, D-21073 Hamburg, Germany 
{ j . w . schmidt , hw . sehring, skusa, ax . wienberg }@tuhh . de 



Abstract. The two broad cases, data- and content-based applications, differ 
substantially in the fact that data case applications are abstracted first before 
they cross any system boundary while for content cases it is the system itself 
which has to map application content into some data-based technology. 
Through application analysis and software design we are aware of the difficul- 
ties of such mappings. In an interdisciplinary project with our Art History col- 
leagues who are working in the subject area of “Political Iconography” we are 
gaining substantial insight into their Subject-Oriented Working (SOWing) 
needs and into initial requirements for a SOWing environment. In this paper we 
outline the project, its basic models, their generalization as well as our initial 
experiences with prototypical SOWing implementations. We emphasizes the 
conceptual and terminological aspects of our approach, sketch some of the 
technical requirements of a generic SOWing software platform and relate our 
work to various XML-based activities. 



1 Introduction 

As a result of advanced and extensible database technology now being available as 
off-the-shelf products, a substantial part of database research and development work 
has generalized into work on models and systems for multimedia content management. 
R&D in content management includes a range of models and systems concentrating on 
services for the following three lines of work: 

- content production and publication work using multimedia documents; 

- classification and retrieval work based on document content; 

- management and control of such work for communities of users differentiated by 
their roles and rights, interests and profiles, etc. 

The work reported in this paper is based on an interdisciplinary project with a partner 
from the humanities with strong semantic and weak formal commitment. Our project 
partner specializes on work in icon- and text-based content from the subject area of 
“Political Iconography”. This content is organized as a paper- and drawer-based sub- 
ject index (PI-“Bildindex”, BPI, see Fig. 1) and is used for Art History research and 
education [33]. 

Iconographic work has a long tradition in Art History, dating back to 19 th century 
“Christian Iconography”, and is based on integrated experience from three sources: 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 3-26, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




4 



J.W. Schmidt et al. 



- Art: multimedia content; 

- Art History: process knowledge; 

- Library Sciences: subject-oriented content classification and retrieval. 

In our context we use the notion of subject-orientation very much in the sense of li- 
brary science, as, for example, stated by Elaine Svenonius: “In a subject language the 
extension of a term is the class of all documents about what the term denotes, such as 
all documents about butterflies.” This understanding differs substantially from natural 
language where “the extension, or extensional meaning, of a word is the class of enti- 
ties denoted by that word, such as the class consisting of all butterflies” [30], And 
both understandings are in clear contrast to the semantics of terms in programming 
languages and database models. 

The interdisciplinary project “Warburg Electronic Library (WEL)” models and 
computerizes BPI content and services [3], [19], and the WEL prototype allows inter- 
disciplinary experiments and insights into multimedia content management and appli- 
cations. 

The overall goal of the WEL project is 

- the generalization of our subject-oriented working experience, 

- a work plan for R&D in subject-oriented content management, and 

- a generic Subject-Oriented Working environment (SOWing environment). 
Currently, many contributions to such R&D are based on XML as a syntactic frame- 
work which provides a structural basis as well as some form of implementation plat- 
form. The main reasons for XML’s powerful position are its strong structural com- 
mitment and its semantic neutrality. 

Successful content management requires that the three lines of work 

- content production and publication work by multimedia documents; 

- classification and retrieval work based on document content; 

- management and control of such work for communities of users differentiated by 
their roles an rights, profiles and interests, etc. [21] 

are not supported in isolation but in a coherent and cooperative working environment 
covering the entire space spanned by all three dimensions. 

The main reason why XML-based work on content management often falls short 
can be stated as a corollary of XML’s strength (see above): its weak semantic and 
exclusively structural commitment. Much of the XML-based R&D contributes to the 
above three lines of work only individually. Examples include [7], [31 ]. 

The paper is structured as follows. Section 2 introduces the two projects involved, 
the “Bildindex fur Politische Ikonographie (BPI)” and the “Warburg Electronic Li- 
brary (WEL)”. In Sect. 3 the WEL model is generalized towards a generic “Work 
Explication Language”. A system’s view of the WEL prototype is described in Sect. 4 
and the first contours of a generic Subject-Oriented Working environment (SOWing 
platform) are outlined. Related work, in particular work in the XML context, is dis- 
cussed in Sect. 5. The paper concludes with a short summary and a reference to future 
work in our long-term SOWing project. 




Subject-Oriented Work: Lessons Learned from a Content Management Project 5 

2 Warburg Electronic Library: An Interdisciplinary Content 
Management Project 

The development of the currently predominant data management models was heavily 
influenced by application requirements from the business and banking world and their 
bookkeeping experience: the concepts of record, tabular view, transaction etc. are 
obvious examples. Data model development had to go through several generations - 
record-based file management, hierarchical databases and network models - until the 
relational data model reached a widely accepted level of abstraction for database 
structuring and content-based data operation. 

For traditional relational data management we basically assume that content is 
“values of quantified variables” from business domains operated by transactions and 
laid out as tables with rows and columns. The questions arise: 

- How can we generalize from data management to the area of content management 
where content domains are not “just dates and dollars”, content operation goes be- 
yond “debit-credit transactions” and content layout means multimedia documents? 

- What are key application areas beyond bookkeeping which help us understand, 
conceptualize and finally implement the core set of requirements for multimedia 
content management in terms of domain modelling, content-oriented work support 
as well as content (re-) presentation? 




Fig. 1. Working with the Index for Political Iconography in the Warburg-Haus, Hamburg 



The work presented here is based on an interdisciplinary R&D-project between 
Computer Science and Art History, the “Warburg Electronic Library” project. The 



6 



J.W. Schmidt et al. 



application area was chosen because of Art History’s long-term working experience 
with content of various media. The project itself is founded on extensive material and 
user experience from the area of “Political Iconography”. 



2.1 Subject-Oriented Work in Political Iconography 

Political iconography basically intends to capture the semantics of key concepts of the 
political realm under the assumption that political goals, roles, values, means etc. 
requires mass communication which is implemented by the iconographic use of im- 
ages. Our partner project in Art History, the “Bildindex zur Politischen Iconographie 
(BPI)”, was initiated in 1982 by the Art Historian Martin Warnke [33] and consists of 
roughly 1,500 named political concepts ( subject terms, “Schlagworte”) and more than 
300,000 records on iconographic works relevant to the BPI. In 1990 Warnke’ s work 
was awarded the Leibniz-Preis, one of the most prestigious research grants in Ger- 
many. 




<origin: <France, >; 
artist: anonymous> 






properties of 
work as project 






\yUL Uu-UM*. 



properties of 
work as product 



<title: St Moritz; 
media: book painting; 
archiving: London ...; 



conceptual properties 
of work 



provenance: postcard> 



motives: <.... lance, flag. ...» 



Fig. 2. BPI “Bildkarte” St. Moritz (image card) describing art work by attribute aggregation 



Starting with this experience, BPI work essentially relies on an Art Historian’s 
knowledge of (documents refering to) political acts in which images play an active 
role. Art Historians interpret “acts” as encompassing aspects of 





Subject-Oriented Work: Lessons Learned from a Content Management Project 7 



- “projects” (who initiated and contributed to an act? the when and where of an act? 
etc.); 

- “products” (what piece of art did the project produce? on what medium? place of 
current residence etc.); and finally, the 

- “concepts” behind the act (what political goals, roles, institutions etc. are ad- 
dressed? what iconographic means are used by the artist? etc.). 

On this knowledge level, BPI work identifies political concepts and names them 
individually by subject terms - e.g., by “ruler”, “prince”, “pope”, “equestrian 
statue”. 

Subject term semantics is methodologically captured and systematically repre- 
sented in the BPI by the following steps: 

1. designing a conceptual, prototypical and representative (mostly mental) model for 
each subject term, e.g., a prototypical equestrian statute; [29] 

2. giving value to the relevant variables or facets of such prototypes by reference to 
the Art Historian’s knowledge of “good cases”, i.e., political acts with an icono- 
graphic dimension. Each such variable or facet is represented by a BPI entry 
(“Bildkarte”, “Textkarte”, “Videokarte” - “media card” etc.) which holds a de- 
scription of a “good case" for that facet, see for example, St. Moritz, see Fig. 2. 

3. collecting all BPI entries on the same prototype into a single extent (“Bildkarten- 
stapel”, ..., “stack of media cards”, see Fig. 3) thus defining the semantics of a 
subject term. Additional fine structure may be imposed on subject term extents 
(order, “neighborhood”, named subextents, general association/navigation etc.); 

4. maintaining a (“completion”) process aiming at a “best possible” definition of the 
subject area at hand by 

“representative” subject terms covering the subject area at hand; 
“qualifying” prototypes for each subject term; 

“complete” sets of facets for prototype description; 

“good” cases for facet substantiation. 

This makes it quite clear that the BPI is by no means just an index for accessing an 
image repository. The BPI uses images only in their rather specific role as icons and 
for the specific purpose of contributing to the description of cases and thus to the 
semantics of subject terms [32]. In this sense, images represent the iconographic vo- 
cabulary of BPI documents just as keywords contribute to the linguistic vocabulary of 
text documents. 




J.W. Schmidt et al. 




St. Montz (St Mauricw) wird 
seit Ausgang des 1 1 . Jahr- 
hunderts bevorzugt mit 
Rustling, Scftild und Lanzc. oft 
auf emem ScDimmel reiteo d ^ 
dargestelH 




Fig. 3. BPI subject term semantics (e.g. equestrian statue) by media card classification (image, 
text, video cards etc.) 

Art Historians with their long tradition of working with content represented by mul- 
tiple media are far from restricting themselves to a mainly technical view on multime- 
dia as most of the currently booming projects in Computer Science seem to do. Our 
Art History colleagues are much closer to the message of people such as Marshall 
McLuhan who understand media as “extensions of men [18]”. 

The BPI has essentially two groups of users: 

- a few highly experienced BPI editors for content maintenance and 

- various broader user communities which access BPI content for research and edu- 
cation purposes. 

Being implemented on paper technology, the traditional BPI shows severe conceptual 
and technical shortcomings: 

- conceptually: the above attributes “representative”, “good”, “complete”, etc. are 
highly subjective and, therefore, “completion semantics” is hard to meet even 
within a “single -person-owned” subject index; 

- technically: severe representational limitations are obvious and range from single 
subsumption of BPI entries to a lack of online and networked BPI access. 

In the subsequent section we outline two contributions of the “Warburg Electronic 
Library” project which approaches the above conceptual and technical shortcomings 
through an advanced Digital Library project which, as a prime application, is now 
hosting the “Index for Political Iconography”. 





Subject-Oriented Work: Lessons Learned from a Content Management Project 9 



2.2 A Subject-Oriented Working Environment: Warburg Electronic Library 

Viewed from our Computer Science perspective which shifted in recent years from 
basic research in “persistent database programming” towards R&D in “software sys- 
tems for content management (online, multimedia, the WEL project addresses a 
range of highly relevant and interrelated content application issues: 

- content representation by multiple media: images, texts, data, . . 

- content structuring, navigation and querying; content presentation; 

- content work exploiting subjects and ontologies: classification, indexing, ...; 

- utilization of different referencing mechanisms: icon, index, symbol; 

- cooperative projects on multimedia content in research and education. 

The WEL is an interdisciplinary project between the Art History department of Ham- 
burg University (Research Group on Political Iconography, Warburg-Haus, Hamburg) 
and the Software Systems Institute of the Technical University, TUHH, Hamburg. It 
began in 1996 as a 5 -year project and will be extended into an interdisciplinary R&D- 
framework involving several Hamburg-based institutions. 

For a short WEL overview we will concentrate on two project contributions: 

- semantic modelling principles for WEL-design; 

- personalized digital WEL libraries based on project-specific prototypes and their 
use in Art History education. 



WEL Semantic Modelling Principles. The WEL design is based - as is already the 
BPI design - on the classical semantic data modelling principles [28], [6]: 
aggregation, classification, generalization / specialization and association / navigation 
(see figs. 2 and 3). 




Fig. 4. Media card associated with (multiple) subject terms and with information on classifica- 
tion work 

However, it is important to note that the semantics of subject classes and their en- 
tries originate from different semantic sources and, therefore, go beyond classical data 
modelling (see also Sect. 4.2): 






10 



J.W. Schmidt et al. 



- object semantics : seen from a data modelling point of view, subject class entries are 
also entities of some object classes in the sense of object-oriented modelling. How- 
ever, a subject class extent may be heterogeneous because its entries may describe 
documents of different media - texts, images, videos etc. Therefore, subject class 
entries viewed as objects may belong to different object classes - text, image, video 
classes etc. 

- content semantics : furthermore, all content described by the entries of the same 
subject class shares some semantic key elements. All BPI documents referring, for 
example, to the subject class “ruler” make use of graphical textual key icons such 
as swords, crowns, scepters, horses etc.; similarly, the text documents associated 
with a certain subject class contain overlapping sets of subject-related keywords. 
Such sets of key icons capture essential parts of content and, thus, of subject term 
semantics. Note that specialization of subject classes goes along with extension and 
union of subject-related key icon sets while generalization relies on reduction and 
intersection. 

- completion semantics: in Sect. 2.1 we referred to the soft semantic constraint of 
achieving best possible subject definition as “completion semantics”. Although this 
may be considered more as an issue of class pragmatics than class semantics it im- 
plies formal constraints on subject class extents. Since users of subject definitions 
act under the assumption that the subject owner established a subject extent which 
represents all relevant aspects of the owners subject prototype, any change of that 
extent is primarily monotonic, i.e., extents of subject terms are only changed by 
adding or replacing its entries. Therefore, references to subject class entries should 
not become invalid. 

Figure 4 shows a media card for St. Moritz together with the (multiple) subject terms 
to which St. Moritz contributes. The example also links St. Moritz to details of the 
classification process by which this card entered the WEL. This information is essen- 
tial for the realization of project-oriented views on subject terms, or reference librar- 
ies; 

- thematic views ( customization ): projects usually concentrate on sub-areas of the all 
encompassing “Index for the Political Iconography”; 

- personalized views ( personalization ): they cope with the conceptual problems with 
“completion semantics” mentioned above. 

Initial experiences with both of these viewing mechanisms are outlined in the subse- 
quent section. 




Subject-Oriented Work: Lessons Learned from a Content Management Project 1 1 



Subject-Oriented Work in Art History Education. A key experience of the WEL 

project relates to the two dimensions of subject-oriented work: subject-orientation as a 
thematic view ( customization ) and as an individualized view ( personalization ). 
Speaking in terms of Digital Libraries both dimensions are approached by the WEL 
concept of “reference libraries” (“Handbibliothek”), which are essentially SOWing 
environments customized and personalized according to the requirements of 
individual projects or persons [17]. 




Fig. 5. Thematic and personalized subject views for the “Mantua” seminar 

Figure 5 outlines the use of customized and personalized SOWing environments in 
an Art History seminar on “Mantua and the Gonzaga” [26]. The general BPI 
(“Warnke-owned”) is first customized into a subject index for the “Mantua and the 
Gonzaga” seminar project. The main objective of the individual student projects in 
that seminar is to further personalize the seminar index, structurally and content-wise, 
and to produce, for example, a project-specific subject index for topics such as “Stu- 
diolo” or “Camera picta” [20]. Publicizing the final subject content in some form of 
media document (see Fig. 6) - traditional print report or interactive website - consti- 
tutes another seminar objective [34]. 




12 



J.W. Schmidt et al. 



text 



publicize 




.virtue* 

reference library 
student group AG-3 



structured as 




publicized as 


multimedia 




multimedia 


reference library 




. document , 




Fig. 6. The “Studiolo” subject index publicized as a multimedia document 



3 Towards a Generalized WEL-Model for Subject-Oriented Work 

The prime experience gained from our interdisciplinary WEL project is a deeper in- 
sight into the intertwining of SOWing entities and their working relationships. Con- 
ceptually, the services of our SOWing platform are primarily based on four kinds of 
entities which are straight-forward generalizations of the corresponding WEL entities: 

- “work cases” as the basic abstraction of the acts and entities of interest in a domain; 

- “case documents” which are abstract or physical entities reporting on such work; 

- “case entries” which record the essence of work case documents; 

- “subject terms” which, based on such case entries, structure the domain, define its 
semantics, and give access to its documents and works. 

Since all four notions are quite generic, there is ample space for generalizing in our 
SOWing project the models and systems for subject-oriented work far beyond our 
initial WEL approach. 

In the subsequent sections we will outline an extended SOWing model and plat- 
form and re-interpret the acronym WEL from “Warburg Electronic Library” to “Work 
Explication Language”, very much in the sense of the definition of ontology as “a 
theory regarding entities, especially abstract entities to be admitted into a language of 
description” [35]. 

Subsequently, we discuss the four kinds of SOWing entities in more detail. 




Subject-Oriented Work: Lessons Learned from a Content Management Project 13 



3.1 On a Generic Notion of “Work” 

Central to the content management support provided by our SOWing platform is a 
generic notion of “work”. Our work concept is based on the WEL experience and is 
characterized and modeled by three groups of properties (see also Fig. 2): 

- work as a “project”, i.e., the circumstances under which work is performed; 

- work as a “product”, i.e., a work’s result; and 

- work as a “concept”, i.e., the conceptual idea behind a work. 

Figure 7 relates these three work characteristics using a “work triangle” diagram. The 
generalization of WEL work examples such as the one given by Fig. 2 is obvious. 

The upper part of Fig. 8 depicts a second work case, similarly structured but rather 
different in nature. Fig. 8.2 depicts a WEL work case done by a Mr. B. when produc- 
ing a work document description of the Gonzaga-Mantegna-Minerva work case and 
entering it into the WEL. While there may be only partial knowledge on the renais- 
sance work case - essentially only Mantegna’s picture survived - the WEL work case 
being supported and observed by the SOWing platform can receive an arbitrarily 
extensive SOWing coverage. 

This reflective capability is probably the most powerful and unique aspect of our 
SOWing approach. Reflection provides the basis for a wide range of services for cus- 
tomization and personalization, self-description and profiling and for all kinds of 
guiding and tracing support [27]. 



concept 



l implementation / abs,rac,io ‘ 1 t 
=> ( ) 



project 



product 



Fig. 7. WEL work structure (work triangle) 

Examples of WEL work are presented in Fig. 8. The lower part (Fig. 8.1) models a 
specific work case by which a member of the Gonzaga family residing in Mantua 
during the 15 th century asked the artist Mantegna for a painting addressing the issue of 
virtues and sins. Mantegna chose the goddess Minerva as the central motive and pre- 
sented her expelling the sin out of the garden of virtue. This 15 th century work case 
may be reported by some publicized work document, most probably, however, the 
case is just part of an Art Historian’s body of knowledge about the Italian renaissance. 





14 



J.W. Schmidt et al. 



3.2 Work Case Documents 

Work such as the Gonzaga-Mantegna-Minerva case of Fig. 8 is documented typically 
in narrative form and presented by some multi-media documents - texts, images, 
speech etc. and combinations thereof. In our SOWing approach such documents are 
assumed to represent content in terms of the above three dimensions: work project, 
work product and the concepts behind both (see also Fig. 7). This view on documents 
is quite general and allows interpretation ranging from the rather informal but very 
expressive documents of “Political Iconography” to partially formalized diagrams 
such as flow charts and UML entities and to computer programs and operator instruc- 
tions with a fully formalized semantics. 



A 



L 



{ ... nobility: { on Gonzaga ) ... artist: ( on Mantegna ) ... icon: 



fig 8 1 description work 
Mr, B describing 

the Gonzaga/Mantegna/Minerva case 



T 



A 



M 



j 



( - ) 



fig. 8.2 production work: 

Mantegna painting 

Minerva as virtue for Gonzaga on the occasion of 



> 



Fig. 8. WEL work graph: production work (Fig. 8.1) and description work (Fig. 8.2) 

Work documents may also vary in terms of their completeness in the sense that the 
work they report may be known only partially, for example, by its product. Many 
examples can be found in Art History where most of the project knowledge is usually 
lost and only the image survives (the opposite case also exists). Nevertheless, we 
agree with our colleagues from Art History that products should never be considered 
in isolation but always be recorded in the context of the project (persons, time, place, 
tools, etc.) for which they were created. For documents produced within a computer- 
ized environment this has partially become standard although there is no overall con- 
cept of what to do with this information. 

Note the dialectic character of our SOWing position in this point: on the one hand 
side, work cases are assumed to be reported by documents, on the other hand all 
documents - at least the ones produced by the SOWing environment and its tools - 
are considered as work cases and, yet again, reported by work case documents and 
entries. 




Subject-Oriented Work: Lessons Learned from a Content Management Project 15 



3.3 Work Case Entries 

In the SOWing approach work case documents are described and recorded by work 
case entries which establish the relationship between subject terms and their semantics 
on one side and the work cases and their documentation on the other. Such case en- 
tries generalize the media cards of the WEL system. 

Conceptually there is a third relationship involved which associates a work case en- 
try with a class of icons considered “representative” for the kind of work which is 
described by the case. We use the term “icon” here in the sense of iconic signifier (as 
opposed to indexical or symbolic signifiers, see, for example, [9]). For image docu- 
ments icons specialize to iconographic signifiers, for text documents to keywords 
(“Stichworte”) etc. 

A case entry on an image and text document with content on the subject term 
“ruler”, for example, will for its image part be described by characteristic icons from a 
subject-specific icon set {crown, sword, scepter, ...}, for its textual part by a corre- 
sponding set of keywords. 

Seen from an object point of view, case entries are instances of media-specific ob- 
ject classes (image, text, video classes etc.) while from a content perspective they 
draw constraints from (hierarchies of) icon classes. Finally, from a subject-oriented 
position, case entries become members of the extent of some subject classes (our old 
WEL stack of media cards) thus contributing to the definition of their semantics. 

In Sect. 4 we will relate and discuss these three perspectives in terms of class dia- 
grams (Fig. 10). 



3.4 Subject Definition Work 

Subject term semantics is essentially defined extensionally by document descriptions. 
Intentionally their semantics is captured in part by icon classes shared by such extents. 
Both extensional and intensional semantics are related by the fact that each entry into 
the subject term’s extent shows a characteristic profile over the icon class related to its 
subject term. A specific image document contributing to the semantics of “ruler” will 
not display the entire icon class for rulers, i.e., the set {crown, sword, scepter, ... j but 
a characteristic subset of it, probably a crown and a sword in a prominent position 
within the image. 

Most of the production work on which the BPI is based typically took place outside 
a computerized environment - usually the referenced iconographic work dates back 
several centuries. Production work may, however, also mean the production of docu- 
ments about the original iconographic work and such document work may well con- 
tribute to or profit from a SOWing environment. 

Description support being definitely a matter of the SOWing platform may, for ex- 
ample, provide reference to other subject terms covered by the index [22], For the 
Gonzaga-Mantegna-Minerva case it may be quite enlightening to capture the reason 
why Gonzaga ordered that picture by referring to some subject term “virtue” or its 
generalization “political objectives” to which Machiavelli’s work “II Principe”, a 
successful handbook for renaissance rulers, contributed. 




16 



J.W. Schmidt et al. 



subject 

work 



* ^ 

subject work i 



subject 

y rtfartnc^ 

A =* z..Nw/\ 

subject terms | 



-- description 
work 

subject 

/ \ reference 

, \ ► 



4k 

description work 



= / meta data 



production 

work 



A 



▲ 



/ u V 

- (...) 



subject 

reference 



production work 







▼ 




. 




▼ 




Fig. 9. An overview of Subject-Oriented Work 

Figure 9 gives an overview of the subject-oriented work and its support through a 
SOWing platform. It relates production and description work to subject work. 

As mentioned above, a major group of SOWing services is based on the fact that 
“work is a first class citizen” in the SOWing world. This implies that work while being 
supported by the system is automatically identified, described and associated with 
work-related subject terms (project time and participants, product media and archives 
etc.). Evaluating such terms provides the basis for substantially improved work sup- 
port, e.g., work session management, work distribution, protection, personalization 
etc. 



4 The SOWing System 

Our field studies provided us with a good basis for user requirements analysis of our 
SOWing approach and our extensive prototyping experience allowed us us a deeper 
insight into the architectural and functional alternatives of SOWing system design and 
implementation. 




Subject-Oriented Work: Lessons Learned from a Content Management Project 17 



4.1 WEL Prototype Experience 

The WEL prototypes developed so far are the major source of experience on which 
the SOWing approach is based. The initial version was based on the Tycoon-2 persis- 
tent programming environment, an object-oriented, higher-order programming lan- 
guage with orthogonal persistence [16], [23]. Using a Tycoon-based acquisition and 
cataloging tool the Computer Scientists, Art Historians and many students from both 
departments digitized many thousands of images and transformed file cards from their 
physical representation into a digital one. This work is still in progress. The descrip- 
tive data in the card catalogue were entered and revised by Art Historians via a web- 
based editor. 

With the evolution of Java from a small object-oriented programming language for 
embedded devices to a mainstream programming language for networked applications 
the WEL system was moved to Java-related technology. 

The current WEL version is based on an early version of a commercial Content 
Management System (CMS) [5] which is entirely Java-based and resides on top of a 
relational database management system. 

For our current prototype we were particularly interested in understanding to what 
extent features of commercial CMS technology meet the requirements of our SOWing 
platform. It turned out that although the CMS provided several abstractions useful for 
our SOWing data model important relationships between and inside the subject 
classes could not be treated as first class objects. Furthermore, the modeling facilities 
of commercial CMSs are not yet rich enough to meet the needs of our generic SOW- 
ing approach. Commercial CMS technology concentrates on specific application do- 
mains with editorial processes for more or less isolated units of work (e.g., news arti- 
cles) with little association to other entities. SOWing requires, however, in addition to 
storing and retrieving the document itself an extended functionality for the embedding 
of documents into the SOWing context with all its relationships. 

Commercial CMS and DBMS technology supports only the specific set of naviga- 
tion methods predominant in their main application domain, and our prototype ran 
into performance problems as soon as users left those default navigation paths. As a 
consequence, we met the domain specific access requirements by introducing an addi- 
tional application layer on top of the CMS. 

Several graphical web-based user interfaces as well as GUI editors were developed 
to enable various classes of users to browse through the subject index, create personal 
indices and collect references to documents accessible via the global index. 

In parallel to the Art History project the generic WEL system was adopted to main- 
tain an index created for the management of concepts relevant in advertising. This was 
and still is carried out in cooperation with a commercial agency from the advertising 
industry. 



4.2 Subject Classes, Object Classes, and Icon Classes 

In Sect. 3 we introduced the four kinds of SOWing entities: work cases, case docu- 
ments, case entries, and subject terms. While the former two (Sects. 3.1 and 3.2) are 




18 



J.W. Schmidt et al. 



highly relevant for the conceptual foundation of our SOWing approach, the latter two 
(Sects. 3.3 and 3.4) are also important for system implementation. 

Case entries are related to subject terms by classification relationships. Subject 
terms are structured in a hierarchy - the subject index - made up by subject term gen- 
eralization and specialization. 

As described in Sect. 2.1 the relationship between case entries and subject terms 
has a double meaning: 

- one the one hand, documents are classified by binding their case entries to subject 
terms; 

- on the other hand, each document contributes to the subject term’s definition. 

The set of entries chosen to define a subject is supposed to be minimal; only entries 
which introduce a new and relevant facet of the prototypical extent are added. In this 
way subject indices reflect the domain knowledge from the perspective of the owner 
of the subject index. 

Subject indices are not only used to capture “primary knowledge” of domains but 
also “secondary knowledge” as, for example, 

- on the organizational and history of a community, including knowledge of users, 
rights granted to them, etc. [8] 

- on the layout and handling of documents, i.e., properties not directly related to their 
content, e.g., (kind of) origin, document types, quality, etc. 

This leads to different types of subject indices some of which are used by the SOWing 
system itself, e.g., user classification to handle project-specific access rights. 

Since entries can contribute to more than one subject term definition they may be- 
long to more than one subject extent maybe in different indices of the same or differ- 
ent types. The semantics of multiple subsumption varies in each of the cases. 

Several contributions define the semantics of the instantiation relationship: 

- object semantics: technically, descriptions have a type (in the sense of a data type); 

- content semantics: in addition to the attributes with object semantics, the use of 
icons for content description determines a semantic type; the more special a subject 
becomes the more icons the entries in its extent are expected to have; 

- completion semantics: the extent of a subject term is supposed to fully describe its 
semantics (at the current time, for the user who created it). 

It is interesting to see how in our design of SOWing entities traditional elements of 
object-oriented classes coexist with novel aspects of subject-oriented modelling. This 
coexistence of object- and subject-oriented semantic elements is illustrated by the 
diagram in Fig. 10 which is based on an extended UML notation. 

The upper right third of the diagram shows a traditional class hierarchy of an ob- 
ject-oriented model. On top is the class “Object” representing the root of the class 
hierarchy. From this a class “Case Entry” is derived which has attributes for the fea- 
tures which all kinds of case entries share. Subclasses of “Case Entry” are introduced 
for each media type. These classes might introduce further features, e.g., painter for 
images, author for texts. Instances (“entry” in the diagram) of such classes are con- 
structed in the usual object-oriented manner: the object is created for a given class, so 
that its structure is known for its whole lifetime. 




Subject-Oriented Work: Lessons Learned from a Content Management Project 19 




Fig. 10. Subject, object, and icon classes 

Important for the SOWing model is the special attribute “icons” which is defined 
by “Case Entry” and which represents references to an icon class. Icon classes de- 
scribe sets of icons which relate to subject terms. Icons reflect the content of the de- 
scribed document, e.g. keywords in a text or symbols in an image. As indicated in the 
upper left of the diagram (Fig. 10), icon classes are ordered in a specialization rela- 
tionship which is derived from the inclusion of the icon sets. We use the filled-in ar- 
row head to visualize icon class specialization and a thick arrow head for icon class 
“instantiation”. 

Finally, the lower part of Fig. 10 shows a (UML-inspired) formulation of subject 
classes. We use double-headed arrows for specialization and instantiation. Dual to the 
(object) class “Object” we introduce a (subject) class “Subject” as the root of the 
subject class hierarchy. Although subject terms and icon classes are only loosely re- 






20 



J.W. Schmidt et al. 



lated, it is very likely that all the entries of the same subject class will have a similar 
profile over the icons of the corresponding icon class. 

The “Subject Index” to which a subject belongs defines the domain to which a sub- 
ject term contributes. This models the perspectives under which the classification of a 
description can be viewed. For any given application often only one index will be 
considered at a time. However, personalization and customization will require the 
SOWing system to cope internally with several indices simultaneously. 

There are two fundamental uses of the structure shown in the diagram, Fig. 10: 

- if a subject worker manually establishes the classification relationship between an 
entry and a subject term (“entry” and “equestrian” in the example of Fig. 10), the 
entry contributes to the subject term’s definition. For the related icon classes this 
may mean that the icon class can be derived from or validated by the icons of the 
entries in an extent; 

- vice versa, icons assigned to a description can be matched against an icon class 
which in return corresponds to a subject term. Using some distance function the 
SOWing system can derive or propose the classification of an entry. 

In this way the subject classification differs substantially from object classification: in 
contrast to an object class a subject class does not define a uniform structure for all 
members of its extent. In addition, subject class entries may be members in more than 
one extent simultaneously and may change subject class membership during their 
lifetime. In contrast, an entry in its role as object belongs to exactly one object class 
and this membership is immutable over time. 



4.3 Personalization Facilities 

We substantiate some of the architectural decisions of the SOWing environment by 
giving examples from the Warburg Electronic Library. First we look at a the personal- 
ization [24] of a description work. Imagine the user downloads a case entry as the 
following XML document: 

<picture-card> 

<title>Bonaparte Crossing ...</title> 

<artist>David, Jacques -Louis< /art ist> 

</picture-card> 

Once a user has copied the description into his personal working environment he is 
free to modify it at will and might end up with a document like the one shown below. 

<picture-card> 

<title>Napoleon Crossing ...</title> 

<artist>/ artistdb/ artist4368 . xml < /artist > 
<medium>Painting</medium> 

</picture-card> 

Clearly, three changes of a different nature were performed: 

- a value change: from “Bonaparte” to “Napoleon”; 

- a type change: “artist” now is a reference, no longer a value; 

- an additional feature is added: “medium”. 




Subject-Oriented Work: Lessons Learned from a Content Management Project 21 



The latter two changes exemplify the semi-structured nature of the descriptions: part 
of the personalized descriptions conforms to the schema the community chose for 
descriptions. Another part was added by the user, freely choosing a tag name, the 
elements content format, and the position where to integrate the new element. This is 
the kind of liberty the subject workers expect to have in the SOWing environment. 

When the community decides to accept and re-integrate the user’s contribution into 
the community’s subject index it has to perform the data (eventually also the schema) 
update [4]. The SOWing server discovers such type changes by tracking the descrip- 
tion work [11]. 



4.4 Case Entry Generation from XML Documents 

A SOWing user may submit a work case document in the form of the following XML 
document: 

<report> 

In the <medium>Painting</medium> 

<title>Napoleon Crossing ...</title> the artist 
<artist>David, Jacques-Louis</artist> depicts 
Napoleon riding . . . 

</ report > 

Then the SOWing system will start analyzing the above work document and generate 
the following initial work case entry: 

<picture-card> 

<!-- according to the original conceptual model --> 
<title>Napoleon Crossing ...</title> 

<artist>David, Jacques -Louis< /art ist> 

<!-- additional markup recognized --> 
<medium>Painting</medium> 

</picture-card> 

This automatically generated version of a work case entry serves as the basis for the 
description worker’s SOWing task. 



4.5 SOWing Interfaces and External Tool Support 

SOWing entities - work cases, documents, case entries, and subjects - may vary 
widely by their degree of formalization ranging from rather informal entities (to be 
mediated to humans) to fully formalized structures (to be read and processed by ma- 
chines). Furthermore, SOWing entities may be consumed and produced by external 
tools thus requiring a general interfacing technology to the world outside the SOWing 
platform. A more detailed discussion in the context of XML can be found in Sect. 5. 

Our vision of a SOWing system is to support a community of users in a long-term 
process of sharing, evolving and partially formalizing their understanding of a com- 
mon application domain - and not to force them to use a SOWing system. If such a 




22 



J.W. Schmidt et al. 



system is to be used it must be possible to create a wide variety of external representa- 
tions of the entities and relationships maintained by the system. 

This is required for two reasons: 

- Users want to import the content from such a system into their own working envi- 
ronment (word processors, multimedia authoring tools, expert systems etc.). For 
this purpose a portable external representation is needed which can be understood 
by different software systems - at least in part. 

- The second reason for a portable and even human readable format is the necessity 
to facilitate the exchange of the data and the knowledge behind it between different 
people. Especially those who do not possess an identical or compatible SOWing 
system will need such an external representation. We consider the extensible 
markup language as being useful in both scenarios. 

On the other hand the SOWing system itself must also be able to process data from 
various sources. Therefore, either converters between external data and the internal 
SOWing data model have to be provided or well-defined interfaces to external repre- 
sentations must be available. 



5 XML for Subject-Oriented Work 

In the previous sections we developed a scenario for a software environment which 
supports and records subject-oriented work. As soon as the content maintained by 
such a system is communicated to others or used cooperatively with different systems, 
the question arises, how to represent content and knowledge about it in a system- 
independent manner. The Extensible Markup Language and its standards [2], [15] are 
intended to enhance content structure and coherence and, by doing so, to improve 
content production and interchange. 

Parallel to our work on the SOWing environment much work has been invested in 
XML-based standards for various content-related services [1], However, most of these 
standardization efforts lack a common application scenario which could prove more 
than the usefulness of isolated standards, e.g., by making two or more of them collabo- 
rate. We regard our SOWing system as a cooperative platform for suites of XML- 
based services. 



5.1 XML and Documents 

One application of XML in the scenarios mentioned above is its use as a platform- 
independent notation for content exchange. XML can be read and processed by per- 
sons as well as by machines. Processing in this context means that machines are able 
to store XML documents and increase their coherence by performing certain consis- 
tency checks, e.g., whether XML data are well-formed or valid according to a certain 
document type definition. With pure XML, as stated in [10], only syntactical consis- 
tency and interoperability can be achieved. If a target machine is supposed to go be- 
yond that point reason about the content of an XML document, domain knowledge has 




Subject-Oriented Work: Lessons Learned from a Content Management Project 23 



to be hardwired into the processing machine and this knowledge has to be synchro- 
nized with assumptions from the content author. 

Under this assumption pure XML then could be used to facilitate communication 
between partners who share a common understanding of the nature of the content 
being exchanged. 



5.2 XML and Semantics 

Up to a certain degree XML can be used to represent subject terms and their relation- 
ships. However, as [12] argues, important ontological relationships (e.g., “subclass- 
of ' or “instance-of ’) can not be modeled directly by a XML document type definition. 
Another problem arises because XML allows different ways to model the same rela- 
tionship and no support for some notion of equivalence. Properties of a concept can be 
modeled in at least two ways: as an XML element on its own or as an attribute of 
another element. Supposing a document type description is used for some ontological 
statement, the receiver of a document based on this statement has to “know” that this 
document type represents an ontology and how it can be derived from the document 
type definition. Currently several proposals for a common formalism for ontology 
representation are being discussed [10], [12]. 

The Resource Description Framework (RDF, see, for example, [15]) provides 
primitives which facilitate the representation of ontologies in a much more natural way 
compared to (pure) XML. RDF can be implemented on top of XML and is suggested 
to be used as a basic framework for the definition of common ontology interchange 
languages, for example, OIL [7], 

We expect that by working within the SOWing environment in combination with 
the above standards we can substantially improve the re-use of publication work and 
thus simplify the overall publication process. Furthermore, the understanding of 
documents for readers outside the community for which a work was originally pub- 
lished can be improved and machine reasoning about the works becomes feasible to 
the extent to which as both, the SOWing system and the processing machine, share 
subject index information. 



5.3 XML and Activities 

As mentioned earlier our notion of “work” refers not only to the result of a product 
process, i.e., to documents on certain concepts or ideas, but to the entire process by 
which the document is created. Through its reflective capabilities to observe the pro- 
duction process the SOWing system can contribute to questions such as 

- in which sequence were the production steps carried out ? 

- how long did each production step take ? 

- what are the influences leading to a concrete object production sequence ? 

For works belonging to the past this information usually is not available or has to be 
derived from other sources. For description work as carried out by the Art Historians 
in the WEL project, however, information about the production process of such a 




24 



J.W. Schmidt et al. 



description work can be collected almost automatically by the SOWing environment. 
Work cases being “first class citizens” can themselves become subjects to later re- 
search and one does not need to “guess” what the circumstances of a production may 
have been but one knows what these circumstances were - at least to the extent the 
SOWing environments model documents them. 

In this sense the SOWing environment can document the process which led to a 
certain document and enhance the understanding of the “work” behind the document. 
On the other hand this process description can be used to derive recommendations for 
future production processes in the sense of instructions that have to be carried out for 
tasks similar to a successfully completed process [14]. Well-documented production 
processes can serve as templates for future production and contribute to process stan- 
dardization [25]. XML-based languages such as XRL (exchangeable Routing Lan- 
guage [31]) can be used to represent such processes. 

In summary, we expect a twofold contribution when using XML as a common 
SOWing interface: XML-based tools will substantially and rapidly enhance the SOW- 
ing functionality and, in reverse, the SOWing model will give a semantic underpinning 
and connectivity to XML services and thus significantly improve their usability. 



6 Summary and Outlook 

Our SOWing platform, experiments and the SOWing project as a whole aim at relat- 
ing, organizing and defining subjects and documents as well as the work behind it. In 
an interdisciplinary project with our Art History colleagues who are working in the 
subject area of “Political Iconography” we gained substantial insight into their Sub- 
ject-Oriented Working (SOWing) needs and into initial requirements for a generic 
SOWing platform. In this paper we outlined the project, its basic models, their gener- 
alization as well as our initial experiences with prototypical SOWing implementations 
and compared our work with various XML-related activities. 

On the modeling level we improved our understanding of 

- the basic SOWing entities and their relationships; 

- the notion of work, i.e., the production context of content; 

- the role of key icons and key words for content-based subject definition. 

On the system level SOWing project is currently investigating the requirements for 

- a generator-based architecture for SOWing entities and relationships; 

- reflective system technology and its use for advanced SOWing services; 

- customized and personalized SOWing indices; 

- XML-based tool interoperability. 

Future large scale content-oriented project work will have to interact with a substantial 
number of SOWing indices and, therefore, requires a technology for “plugable SOW- 
ing arrays”. SOWing indices have to deliver their content through a wide variety of 
document types ranging from media documents laid out for human interaction to struc- 
tured (and typed) documents for machine consumption. Finally, SOWing support for 
the modeling, management and enactment of content-oriented work has to be further 
improved. 




Subject-Oriented Work: Lessons Learned from a Content Management Project 25 



Acknowledgement. We would like to thank the Ministry for Science and Research, 
County of Hamburg, and the Warburg Stiftung, Hamburg, for their continuous support 
of our research. In addition we gratefully acknowledge a research grant from the 
Deutsche Forschungsgemeinschaft, DFG, in support of our R&D project on 
“Cooperative Reference Libraries”. 



References 

1. Berners-Lee, Tim, Fischetti, Mark: Weaving the Web; the original design and ultimate 
destiny of the World Wide Web by its inventor. Harper, San Francisco (2000) 

2. Bray, Tim, Paoli, Jean, Sperberg-McQueen, C. M., Maler, Eve: Extensible Markup Lan- 
guage (XML) 1.0 (2nd Edition). http://www.w3.org/TR/2000/REC-xml-20001006 (2000) 

3. Bruhn, M.: The Warburg Electronic Library in Hamburg: A Digital Index of Political Ico- 
nography. In: Visual Resources, Vol XV (1999) 405-423 

4. Buckingham Shum, Simon: Negotiating the Construction and Reconstruction of Organiza- 
tional Memories. Journal of Universal Computer Science, vol. 3, no. 8 (1997) 899-928 

5. CoreMedia AG: Homepage, http://www.coremedia.com (2001) 

6. Coulter, Neal, French, James, Glinert, Ephraim, Horton, Thomas, Mead, Nancy, Rada, Roy, 
Ralston, Craig, Rodkin, Anthony, Rous, Bernard, Tucker, Allen, Wegner, Peter, Weiss, Eric, 
Wierzbicki, Carol: ACM Computing Classification System 1998: Current Status and Future 
Maintenance. Technical report, http://www.acm.org/class/1998/ccsup.pdf (1998) 

7. Cover, R.: Ontology Interchange Language, http://xml.coverpages.org/oil.html (2001) 

8. De Michelis, Giorgio, Dubois, Eric, Jarke, Matthias, Matthes, Florian, Mylopoulos, John, 
Schmidt, Joachim W., Woo, Carson, Yu, Eric: A Three-Faceted View of Information Sys- 
tems. In: Communications of the ACM, 41(12) (1998) 64-70 

9. Deacon, Terrence W.: The Symbolic Species. The co-evolution of language and the brain. 
W. W. Norton & Company, New York/London (1997) 

10. Decker, S., van Harmelen, F., Broekstra, J., Erdmann, M., Fensel, D., Horrocks, I., Klein, 
M., Melnik, S.: The semantic web: The roles of xml and rdf. In: IEEE Expert, 15(3) (2000) 

11. Dourish, P., Bellotti, V.: Awareness and Coordination in Shared Workspaces. In: Proceed- 
ings of ACM CSCW'92 Conference on Computer-supported Cooperative Work, ACM- 
Press (1992) 107-114 

12. Fensel, D.: Relating Ontology Languages and Web Standards. In: Informatik und 
Wirtschaftsinformatik. Modellierung 2000, Foelbach Verlag (2000) 

13. Gruber, T. R.: A translation approach to portable ontology specifications. Technical Report 
KSL 92-71, Computer Science Department, Stanford University, California (1993) 

14. Khoshafian, Setrag, Buckiewicz, Marek: Introduction to Groupware, Workflow, and Work- 
group Computing. John Wiley & Sons, Inc., New York (1995) 

15. Lassila, O., Swick, R. R.: Resource Description Framework (RDF) Model and Syntax 
Specification. Recommendation. W3C. http://www.w3.org/TR/REC-rdf-syntax/ (2000) 

16. Matthes, F.. Schroder, G., Schmidt, J.W.: Tycoon: A Scalable and Interoperable Persistent 
System Environment. In: Atkinson, Malcom P., Welland, Ray (eds.): Fully Integrated Data 
Environments, ESPRIT Basic Research Series, Springer- Verlag (2000) 365-381 

17. Maurer, Hermann, Lennon, Jennifer: Digital Libraries as Learning and Teaching Support. 
In: Journal of Universal Computer Science, vol. 1, no. 11 (1995) 719-727 

18. McLuhan, Marshall: Understanding Media. The Extensions of Man. The MIT Press, Cam- 
bridge/Massachusetts/London (1964, 1994) 




26 



J.W. Schmidt et al. 



19. Niederee, C.. Hattendorf, C., MiiBig, S. (with J.W. Schmidt und M. Warnke): Warburg 
Electronic Library - Eine digitale Bibliothek fiir die Politische Ikonographie In: uni-hh 
Forschung, Beitrage aus der Universitat Hamburg, XXXI (1997) 6-16. 

20. Niirnberg, Peter J., Schneider, Erich R., Leggett, John J.: Designing Digital Libraries for 
the Hyperliterate Age. In: Journal of Universal Computer Science, 2 (9) (1996) 610-622 

21. Raulf, M„ Muller, R., Matthes, F., Scheunert, K.J., Schmidt, J.W.: Subject-oriented Docu- 
ment Administration for Internet-based Project Management (in German). In: Proceedings 
“Management and Controlling of IT-Projects”, dpunkt.verlag, Heidelberg (2001) 

22. Rostek, Lothar, Mohr, Wiebke, Fischer, Dietrich: Weaving a Web: The Structure and Crea- 
tion of an Object Network Representing an Electronic Reference Work. In: Fankhauser, P., 
Ockenfeld, M. (eds.): Integrated Publication and Information Systems. 10 Years of Re- 
search and Development at GMD-IPSI, Sankt Augustin: GMD (1993) 189-199 

23. Schmidt, J. W„ Matthes, F.: The DBPL Project: Advances in Modular Database Pro- 
gramming. Information Systems, 19(2) (1994) 121-140 

24. Schmidt, J.W., Schroder, G., Niederee, C., Matthes, F.: Linguistic and Architectural Re- 
quirements for Personalized Digital Libraries. In: International Journal on Digital Libraries, 
1(1) (1997) 

25. Schmidt, Joachim W„ Sehring, Hans-Werner: Dockets: A Model for Adding Value to 
Content. In: Akoka, Jacky, Bouzeghoub, Mokrane, Comyn-Wattiau, Isabelle, Metais, 
Elisabeth (eds): Proceedings of the 18th International Conference on Conceptual Modeling, 
volume 1728 of Lecture Notes in Computer Science, Springer- Verlag (1999) 248-262 

26. Schmidt. Joachim W., Sehring, Hans-Werner, Warnke, Martin: The Index for Political 
Iconography and the Warburg Electronic Library (in German). In: Proceedings of the In- 
ternational Symposium on "Archiving Processes”, Koln (2001) 

27. Simone, Carla, Divitini, Monica: Ariadne: Supporting Coordination through a Flexible Use 
of the Knowledge on Work Processes. In: Journal of Universal Computer Science, vol. 3, 
no. 8 (1997) 865-898 

28. Smith, John Miles, Smith, Diane C. P.: Database Abstractions: Aggregation and Generali- 
zation. In: TODS 2(2) (1977) 105-133 

29. Sowa, John F.: Knowledge Representation, Logical, Philosophical, and Computational 
Foundations. Brooks/Cole, Thomson Learning (2000) 

30. Svenonius, Elaine: The Intellectual Foundation of Information Organization. The MIT 
Press, Cambridge/Massachusetts/London, England (2000) 

31. van der Aalst, W. M. P., Kumar, A.: XML Based Schema Definition for Support of Inter- 
organizational Workflow. 

http://tmitwww.tm.tue.nl/staff/wvdaalstAV orkflow/xrl/isrOl -5 .pdf 

32. van Waal, Henri: ICONCLASS - An iconographic classification system. North-Holland 
Publishing Company, Amsterdam/Oxford/New York, completed and edited by L.D. Cou- 
prie, R.H. Fuchs, E. Tholen, G. Vellekoop, a.o. (1973-85) 

33. Warnke, Martin: Bildindex zur politischen Ikonographie. Forschungsstelle Politische 
Ikonographie, Kunstgeschichtliches Seminar der Universitat Hamburg (1996) 

34. Warnke, Martin (ed.): Der Bilderatlas Mnemosyne. Unter Mitarbeit von Claudia Brink. 
Akademie Verlag, Berlin (2000) 

35. Webster’s Third New International Dictionary of the English Language, Chicago (1996) 




Query Optimization through Removing Dead 

Subqueries 



Jacek Plodzien 1 and Kazimierz Subieta 2,1 

1 Institute of Computer Science PAS, Warsaw, Poland 
{ jpl , subieta}@ipipan. waw .pi 

2 Polish- Japanese Institute of Information Technology, Warsaw, Poland 



Abstract. A dead subquery is a part of a query not contributing to the 
final query result. Dead subqueries appear mostly due to querying views. 
A method of detecting and eliminating dead subqueries is presented. It 
assumes that views are processed by query modification, which macro- 
substitutes a view invocation with the corresponding view definition. 
The method is founded on a new semantic framework of object-oriented 
query languages, referred to as the stack-based approach. Dead parts are 
detected through static (compile-time) analysis of scoping and binding 
properties for names occurring in a query. The method is explained by 
a pseudo-code algorithm and illustrated by examples. 



1 Introduction 

A disadvantage of using virtual (non-materialized) database views is that a view 
often calculates more than it is necessary in a particular view invocation. For 
example, a view calculates, among others, average salaries in departments, but 
in a particular query invoking the view these values are not used. Hence the 
straightforward approach to view processing may result in a significant waste of 
processing time. The problem has been partly solved for relational query lan- 
guages through the technique known as query modification [8] : a view definition 
is treated as a macro-definition, where each view invocation is textually substi- 
tuted with the query from the body of the view definition. Then, the parts of 
the view definition not used in a particular query are removed from the resulting 
query. Such parts are called dead subqueries (parts), because they do not con- 
tribute to the final result. (Dead subqueries may also appear in queries without 
views; however, this is rather rare.) 

Query modification has been adopted and generalized for object-oriented 
query languages such as OQL [12]. This technique prepares a foundation for 
our method of detecting and removing dead subqueries. Such optimization is 
performed through rewriting a query into a semantically equivalent query having 
no dead subqueries. After removing dead subqueries, the resulting query can be 
further processed by other optimization methods, e.g. based on factoring out 
independent subqueries, changing the order of operators, indices [4,5,6]. 

In a general case of object-oriented query languages and views, the problem 
of removing dead subqueries is not easy. A query resulting from query modifica- 
tion can be complex, nested, can involve inheritance and encapsulation, and can 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 27-40, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




28 



J. Plodzien and K. Subieta 



invoke methods or virtual attributes. It can also contain definitions of auxiliary 
names (i.e., correlation variables, variables bound by quantifiers, etc) and can be 
built upon various operators such as selection, dot (i.e., projection/navigation), 
dependent join, quantifiers, arithmetic functions, aggregate functions, and oth- 
ers. Moreover, views can be nested; i.e. a particular view can invoke other views. 

As far as we have investigated the literature, there is no other paper consider- 
ing the problem of removing dead subqueries for object-oriented query languages 
and views. Probably this is caused by the fact that object views are not devel- 
oped to a satisfactory degree, despite many papers (for a discussion see [9]). 

Nonetheless, a related idea is discussed for the relational approach. For in- 
stance, in [3] the authors consider a case when redundant joins of base relations 
appear as a result of using views defined (directly or indirectly) against those 
relations. A similarity between the method discussed in [3] and our solution is 
that the general idea of both techniques is to avoid performing unnecessary op- 
erations. The main difference is that [3] deals with joins, which are the most 
expensive operations in relational databases, and is restricted to CPS-queries, 
while our method concerns any (arbitrarily complex) subqueries whose results 
are not used. Joins are not the case in the object approach, because foreign keys 
are usually replaced with links. Our approach has potential to cover cases when 
dead subqueries are not connected by joins. 

Our (fairly general) approach assumes that object views are first-class stored 
functional procedures [9]. To deal with scoping and binding rules for names 
occurring in queries/programs, we follow the Stack-Based Approach (SBA) to 
object-oriented query languages [5,10,11]. The approach assumes that some query 
operators (where, dot, dependent/navigational join, quantifiers, etc) act on an 
environment stack similarly to program blocks or procedures, that is, they open 
a new section for binding names. 

In our research we focus on static (i.e., compile-time) query optimization. In 
order to gather all the information that is needed to determine whether a query 
can be optimized, a special phase of query processing - so-called static analysis - 
is performed [7]. During static analysis we simulate the behavior of the run-time 
query result stack (QRES) and the environment stack (ES) through the corre- 
sponding static stacks (i.e., compile-time versions of ES and QRES). They keep 
signatures of values returned by queries and signatures of run-time environment 
stack sections and subsections. Recursive analysis of these signatures and their 
relationships with the syntax tree of the processed query makes it possible to 
detect unused environment stack subsections, then to navigate from these sub- 
sections to corresponding parts of the query syntax; tree. These parts are cut off; 
i.e. dead subqueries are removed. 

The general query processing architecture is presented in Fig. 1. First, the 
text of a query is parsed and its syntax tree is constructed. Then, the query 
is optimized by rewriting. Static analysis (performed by a query optimizer) in- 
volves a metabase (a special data structure obtained from a database schema), a 
structure simulating the environment stack (static ES denoted by S_ES), and a 
structure simulating the query result stack (static QRES denoted by S_QRES). 




Query Optimization through Removing Dead Subqueries 



29 



syntax tree optimised query 




Fig. 1 . General architecture of query processing 



After optimization the query is evaluated; the evaluation involves run time struc- 
tures, that is, an object store, ES and QRES. 

An important aspect of query optimization is a cost model. However, because 
removing dead subqueries always improves performance, it should be applied 
whenever possible. As a consequence, there is no need to assess performance 
improvement during optimization, even for cost-based optimization. Therefore, 
we do not consider a cost model in this paper. 

The queries in the paper are defined for an example database whose schema 
(the class diagram in a little modified UML) is shown in Fig. 2. The classes 
Lecture, Student, Professor and Faculty model lectures attended by students 
and given by professors working in faculties, respectively. Professor objects can 
contain multiple complex prev-job subobjects (previous jobs). The name of a 
class, attribute, etc, is followed by its cardinality, unless it is 1. All the properties 
are public. 




Fig. 2. The class diagram of the example database 














30 



J. Plodzien and K. Subieta 



The rest of the paper is organized as follows. Section 2 presents a few SB A 
concepts important for the paper. Section 3 describes general properties of dead 
subqueries. In Sect. 4 we discuss in detail our method of finding and removing 
a dead subquery through rewriting. We also present an example of applying the 
method to a query. In Sect. 5 we discuss how to take into account all dead sub- 
queries in a query. In Sect. 6 we extend our discussion by considering other cases 
when dead parts can appear. Section 7 concludes our discussion. 



2 Essential Concepts 



SBA is an attempt to build a uniform semantic foundation for integrated query 
and programming languages. The approach is abstract and universal, which 
makes it relevant to a general object model being an abstraction over e.g. ODMG 
OQL [2] and XML models. SBA has its own query language - SBQL, which is 
a syntactic variant of OQL with a fully formalized semantics. In this section 
we explain a few SBA concepts essential for our method of removing dead sub- 
queries. However, we assume that the reader is at least a little familiar with the 
general idea of SBA. 



2.1 Subsections 

Normally, an ES section induced by a particular non-algebraic query operator 
consists of all binders that are returned by the nested function for the currently 
being processed row of a result table. Thus we cannot distinguish between binders 
created for the results of different subqueries. To make such a distinction possible 
we introduce the concept of subsection. It is a part of an ES section, which 
contains binders constructed for one element of the row currently being processed 
for which a non-algebraic operator has opened that section. 

Subsections make it possible to assign each binder of a given section to the 
appropriate subquery of a query being evaluated (represented as a syntax tree) . 
Each subsection contributes to partial results of further subqueries, which in 
turn generate other subsections, etc., up to achieving the final result. Through 
assigning each subsection the node of the corresponding syntax tree of the query, 
we can propagate identifiers of tree nodes in order to obtain information which 
subsections contribute to the result of an outer query. This process is performed 
recursively. Because subsections are associated 1:1 with subqueries (i.e., the cor- 
responding syntax subtrees), in this manner we detect dead subqueries (i.e., 
subqueries associated with unused subsections on ES). 

This process is performed on S_ES and S_QRES. During static analysis we 
deal with section signatures and table signatures, which model run-time ES sec- 
tions and query results, respectively. To detect dead subqueries we use subsection 
signatures, which model run-time subsections. This part of static analysis is dis- 
cussed in detail in Sect. 4. 




Query Optimization through Removing Dead Subqueries 



31 



2.2 Views and Query Modification 

Views 1 raise the level of abstraction and modularity while building an applica- 
tion, but may result in poor performance due to overhead connected with eval- 
uation of a view during each invocation. To deal with this problem, in relational 
databases queries invoking views are optimized through query modification. It 
combines a query containing view invocations with the definitions of the views 
being invoked. The resulting query has no references to views and can be op- 
timized at a textual level by rewriting rules. The advantage of this technique 
is twofold. First, it makes it possible to avoid performance overhead related to 
processing views. Second, it enables to use other optimization methods, e.g. in- 
dices. For these qualities we have adopted the query modification technique to 
object-oriented queries [12]. 

In SBA views are treated as first-class (i.e., stored in a database) functional 
procedures that can return a collection [11]. They can be defined by a single query 
or by a sequence of statements accomplishing a complex algorithm, possibly 
with a local environment. Moreover, they enable the programmer to compose a 
virtual object from several stored objects. Such views can be recursive, can have 
parameters, and can be updateable. The idea of views in SBA supports a fairly 
general object model including classes, methods, inheritance, etc. 

Assuming some discipline concerning the functionality, syntax and seman- 
tics of view definitions, the query modification is reduced to regular macro- 
substitution. Such an approach to views is assumed in OQL [2] through the 
’’define” clause. Query languages and views for XML databases are currently 
the subject of extensive research, but there is little agreement concerning this 
issue, see e.g. [1]. 

In this section we discuss view processing in the context of query optimization 
based on rewriting. In SBQL views are defined as follows (we omit typing): 
view Freshmen 
begin 

return Student where year = 1 
end; 

Functions, like the one above, can be invoked in queries; an example (’’get names 
and ages of freshmen older than 30”) is presented below: 

( Freshmen where age > 30). (name, age) 

This query is evaluated as follows. First, the name Freshmen is bound in the 
ES base section; then the function is invoked. It has no parameters and its local 
environment is empty, so the ES section for this invocation will contain only 
the return address. The body of Freshmen consists of a single query which is 
evaluated in the standard way. The result of this query consists of references to 
appropriate Student objects. Then, the function is terminated and the ES section 
implied by that function is popped. Next, the result returned by Freshmen is 
processed by the where operator: the result contains references to those Student 

1 We deal only with virtual views. Materialized views present another subject, not 
relevant to this paper. 




32 



J. Plodzien and K. Subieta 



objects, for which the value of the age attribute is greater than forty. Then, this 
result is processed by the dot. The final result consists of pairs of references to 
the name and age subobjects for the selected Student objects. 

The functions, introduced as views, possess the following properties: 

• They do not introduce their own local environments. 

• They have no parameters. 

• They are dynamically bound. 

• The only statement of their bodies is a single query. 

Functions having these properties are semantically equivalent to macro-defi- 
nitions. We have constructed the syntax and semantics of SBQL in such a way 
that it is fully orthogonal with respect to combination of any queries, includ- 
ing queries containing auxiliary names (e.g., names defined by the as operator, 
variables bound by quantifiers, new names used in view definitions, etc.). For 
example, after query modification the above query will have the form: 

(( Student where year = 1) where age > 30). {name, age ) 

Thus (in contrast to [8]) our query modification method is extremely simple: 
we just textually macro-substitute a view invocation with the body of this view. 
As will be shown in further examples, new names introduced in the view defini- 
tion for attributes of virtual objects present for us no irregularity or difficulty. 
For a more detailed discussion of the issue see [12]. 

3 Properties of Dead Subqueries 

While searching for dead subqueries we have to exclude subqueries that create 
directly or indirectly the result of a query. For instance, in the query 
Professor x (works jin. Faculty) 

the subqueries Professor and works jin. Faculty are not dead because their results 
are directly parts of the result of the whole query. Similarly, in the query 
Professor. name 

the subquery Professor is not dead because it indirectly contributes to the final 
result. Subqueries that do not satisfy either of those two conditions are dead and 
can be removed. For example, in the query 
( Professor x (worksSn. Faculty)) .age 

the subquery works jin. Faculty does not contribute to the result of the query 
either directly or indirectly and thus is dead. 

In some (rare) cases subqueries that are apparently dead cannot be removed. 
Due to space limit we do not discuss this degenerated case in the paper. 

Before discussing our method for detecting dead subqueries we present a few 
examples and discuss the general idea. 

3.1 Examples of Dead Subqueries 

The most frequent case when a query has dead subqueries is when it contains 
a view invocation and performs navigation starting from the result returned 




Query Optimization through Removing Dead Subqueries 



33 



by the view. Let us show this by an example. The view below gets for each 
professor his/her name, salary, and calculates a bonus lre/she should receive, 
giving them auxiliary names n, s, and b, respectively. The view can be considered 
the definition of virtual objects named ProfNameSalaryBonus having attributes 
n, s, and b. 

view ProfNameSalaryBonus 

begin 

return Professor. 

{name as n, salary as s, {{ON* salary + 10*age)*count(gzues)) as b) 

end; 

To retrieve through this view the names of all professors stored in a database, 
we have to ask the following query: 

Prof NameS alary B onus, n 

After macro-substitution it has the form: 

{Professor. 

{name as n, salary as s, ((0.5 *salary + 10* age)*count{gives)) as b)).n 
Note that the final projection (i.e. ”.n”), which creates the final result of the 
whole query, refers to the result of the subquery name as n but does not refer to 
the results of the subqueries salary as s and ((0.5 * salary + 10*age)*count(<7zwes)) 
as b. Hence the latter two subqueries are dead and can be removed. After the 
transformation the query has the form {Professor.{name as n)).n which can be 
further rewritten to Professor. name by another optimization method (removing 
unnecessary auxiliary names [4]). 

Another non-algebraic operator that may cause a subquery to be dead is a 
quantifier. The analogy with the case for the dot operator is especially well seen 
if a query is written in the form reflecting the schema of its evaluation, that is, 
if quantifiers are used as infix operators [4]. 



3.2 Summary 

As we have discussed, a query may have dead subqueries if it contains dot 
operators or quantifiers. The reason is that a dot and quantifiers have a special 
property: it is possible for them to consume only a part of the result of their left 
subquery, which makes the rest of that left operand not contribute to the result 
of the entire query. 

Other non-algebraic operators considered in SBA, like a selection or a de- 
pendent join, do not have this property. For example, ixi consumes the whole 
result of its left subquery thus does not cause any subquery to be dead. The 
where operator can use only a part of the result of its left operand, but the 
whole of that left subquery determines the result of a query. This means that 
the selection operator does not cause any subquery to be dead either (with some 
exceptions; see Sect. 6). We illustrate this by the following query (’’get professors 
and lectures provided by them assuming that professors are older than fifty”): 
{Professor tx {gives. Lecture)) where age > 50 




34 



J. Plodzien and K. Subieta 



Obviously, it is impossible to find a query after where which will cause that 
some subquery before where will be dead. A similar property holds for to. 



4 Detecting Dead Subqueries 

In order to optimize a query with regard to dead parts, a modified static analysis 
is required. In comparison to [7], a little modified versions of the staticsval proce- 
dure and the static.nested function are used: dsstatic.eval and dsstatic-nested, 
respectively. Moreover, the concept of signatures must be modified to keep the 
following additional information: 

• In a modified table signature (on S_QRES), each element is augmented with 
a pointer to the root of the appropriate syntax query subtree (representing a sub- 
query which returns the result modeled by this element). 

• Similarly, in a modified section signature (on S_ES), each subsection signa- 
ture is augmented with a pointer to the root of the appropriate syntax query 
subtree. Bottom sections’ signatures (for which subdivision into subsections is 
irrelevant) are augmented with NULL pointers. 

Additionally, while binding a name in some subsection signature, the sub- 
query in the syntax tree, for the result of which that signature was built, is 
flagged as ” REFERRED _TO v . This flag means that this particular query syn- 
tax subtree has been used to evaluate its outer query (hence the corresponding 
subquery is not dead). A part of the ds static -eval procedure is presented below; 
it covers only the case of non-algebraic operators after both q 1 and were 
evaluated but before the final result is created. For details and other cases, e.g., 
algebraic operators, see [4]. 

Before applying dsstaticsval to a given query we can scan its definition 
to check whether it can potentially have dead parts, that is, if it contains dots 
and/or quantifiers. If it does not, then we know that it has no dead parts; we do 
not have to use the dsstaticsval procedure to optimize it. 

The auxiliary function Subquery(ss) returns the pointer to the subquery in 
the syntax tree for the result of which (sub)section signature ss was built; for a 
bottom section signature it returns NULL. 

if (9 £ {dot, V, 3}) then 

for each subsection signature ss in newscope do 
(♦check if that (sub) result of gj is used*) 
if (the subquery pointed to by Sub query (ss) is not 
REFERRED_TO) then (*it’s dead*) 
begin 

if (the subquery pointed to by Subquery (ss) is 
the right operand of a dot operator) then 
(♦remove that dead part and a ’’future’’ dead part*) 
remove that dot and its both subqueries; 
else 

(♦remove only the dead part*) 

remove the subquery pointed to by Subquery (ss) ; 




Query Optimization through Removing Dead Subqueries 



35 



remove the operator connecting the eliminated 
part to the rest of the whole query; 
remove from qlresult the element for which ss was 
constructed; 
end; 

Note that because in queries of the form qi 0 qz (where 6 is a clot or quanti- 
fier) only subquery qz can make use of the result of qi when evaluation of qi 6 
qz ends, the result of qi cannot be used by any subquery anymore. Therefore, 
right after that evaluation we check whether the result of gq was used: we do 
it by checking in the syntax tree which parts of the scope opened for the result 
of qi were referred to. Those that were not referred to will not be referred to 
afterward, thus the corresponding subqueries are dead and can be removed. 

In the case when a dead part is the right operand of a dot operator, we remove 
not only that dead part, but also the left operand of that dot operator, since after 
removing its right operand its left operand would become dead. Additionally, we 
remove from the S_QRES table signature those elements that are the results of 
those dead parts, because we do not need to analyze them any longer. 

Note that since the algorithm is performed at compile time, its cost does not 
influence the cost of run-time query processing. 

Example 

We will explain our method by a detailed example. The view 
view LectProfManyCredits 

begin 

return ( Lecture so given J>y. Professor) where credits > 5 

end; 

returns lectures whose credits are greater than five together with the professors 
giving them. To get only the subjects of those lectures via this view, we can ask 
the query 

LectProfManyCredits. subject 

After macro-substitution it has the form 

(( Lecture ixi given-by. Professor) where credits > 5). subject (i) 

Its syntax tree is shown in Fig. 3 (we omit non-terminal grammar symbol nodes 
that appear in this tree). 

Query (i) contains dot operators, so it may have dead parts. The states of 
the static stacks during a modified static analysis of the query are presented in 
figures from Fig. 4 to Fig. 7. For each name being bound we show a pair denoting 
the stack size and the binding level, and for each non-algebraic operator being 
evaluated we show the number of the section it opens. If the section signature in 
which a name is bound consists of more than one subsection signature, we label 
each subsection signature by augmenting the number of the section signature 
with a, b , c, etc, where a denotes the leftmost one. Pointers to subqueries in the 
syntax tree are designated by invocations of the Rootfquery) function, which 
returns the pointer to the root of (sub)query query in its syntax tree. 




36 



J. Plodzien and K. Subieta 




givenby Professor subquery (b) 



Fig. 3. The syntax tree for query (i) 



The initial state of 
S_ES and S_QRES 



After static analysis of 
(( Lecture 
( 1 . 1 ) 



After static analysis of 
(( Lecture x 



After static analysis of 
(( Lecture x given by 
( 2 , 2 ) 



I Person(ref P!rson ) 1 
Wrofessor(ref Pr ^jk 
Student(ref Sluds J 
I Lee ti ire( re f Le cl ure ) ■ 
Faculty <ref Fa J^ I 
NULL 





subjeci(ref sllhjec ,) 
credits(ref creiil s ) 1 

attended_by(ref 
I given_b\iref^ xnhy ) 

Root (Lecture) 




sitbiect!ref iihji J 
credi ts(ref credjl ) 

I attended_by(ref Mlendedl ff 
: given_by(rct, : ,., :Jl ) 

Root(Lecture) 




Person(ref , cr mn ) 1 




Persontre/: 




Person!ref Per ra „) 




fofessor!refp rql - ess J 




■ Professoriref ) 1 




1 Professor! ref ) 1 




Student(ref Slllde J | 




1 Student! ref... ) 1 




Student(ref Slude J 




Lecture(ref, 1 




I.ectureiref. ... rr ) 




1 I.ecture(re/r ) 




faculty! ref Fmd A 1 




Faculty (ref Filclllly ) 




Paculty(ref Filcllily ) 




NULL 




NULL 




NULL 



After static analysis of 
(( Lecture x given by. 



\Professor(ref Pr ^ esso ){ 
R oot(given_by ) 



sitbject(ref sllhJe J I 
| credits! re f credlt ) I 

attended _by{ ref allerided 
given _b\ ( re/ vi , 

Root (Lecture) 



I Person(ref Persoi ) I 
[Profe.s.sor(refp r ^ esso )i 
Student! ref SliiJen ) I 
[ Lecture !ref Le ^ A I 
Faculty(ref Facul J) I 
NULL I 









i re f g tve, 


1 t Rootteiven by) i 


empty 


I i re fLeciure’ R°° t(I,ecture)} | 


I i re f Lecture- Roo t(I,ecture)} | 


{ref Lec . 


lure . Root (Lecture)} 



{ re f greet 


, Jy Root(given_by ) } 


\refut 


-lure- Root(Lecture)} 



Fig. 4. The first part of the analysis 



After static analysis of 
(( Lecture x given_by. Professor 
( 3 . 3 ) 



After static analysis of 
(( Lecture x give n_by. Professor) 



After static analysis of 
((..) where 



After static analysis of 
((..) where credits 
(2.2a) 



Wrofe ssor{refp, ss J]i 
Ro o t (givenby) 



h XsubjecHref^J I 
! feditfrefA I 
ended _by!ref aUended j. 
given_byiref giKnJ P) j 

Rm>t(Professor) 



■ Person(ref Perso f) I 
Profe ssor(refp* sso ^) 
I Stiideiit(ref Sludenl ) 1 
I mLecture/re f eclllre ) I 

I Wacultytrefp^M I 

NULL j 



Persoii(refp ., rsol ,) 
Pmfcssoiiiefp rqfesso 


> 


I Stuck nt(ref sil . denl ) 

H Lee ture(ref ecnilc ] 
Facut Ay ■tf. i . c „if) 

NULL 





■ mfiecKref^J 
I credits(ref . redd j) 

attende d_by!ref attended , ) 
given_by(ref given _ by ) 

I Root/Lecture) 



name(ref mme ) 
age(ref age ) ■ 
soloiy(ref alaly ) 
prevjob(ref pmjob ) 
i vorks_irt(ref w P) 

gives(ref gi ffl^ 
Roo t( Professor) 



Persontre f Perso „) 
Professotire f> r qf essor ) 
Stude nt(re f s , lldenl ) 
Lecture (ref eclllre ) 
Foculty!ref Facully ) 
NULL 



1 credits/ref^j 
attended _by{ref ollen<kd _ h f) 
given_by(ref gjverl _ by ) 

r Root(Leciure) 


age(ref oge ) 1 

™lary(ref salaly ) 
pi e v Jo h(ref prevJol ) 
i t'orks_in(ref works j(j ) 
givefref^ 9 
Ro o t ( P/r» fessor) 


Person(ref f 


rcJ 


Profi'ssor(ref p 


Studentyref 


tJeJ 


I.ectmv(ivf 


'c„J 


Facultyyref 




NULL 



{ re fprrfesson Root (Professor)} 
{ re f g ive„_hy Rootfeh-«i_*>0} 
{ref Lecture- Root(Lecture ) } 



{ref Lec tu 


re , Rwtt(Lecture) j , 


{ re fprrfessc 


,,, Ro o t( Professor) j 



{ re f Lecture- Rwt t( Lecture)} . 

{ref Pro f esson Root (Professor ) } 



{re) credits’ Root(credits)} 

{ ref Lecture - Root (Lecture ) } , 
fefprc/e.aon Root (Professor)} 



Fig. 5. The second part of the analysis 








Query Optimization through Removing Dead Subqueries 



37 



After the static analysis of 
((..) where credits > 



ncme(ref name ) 
age(ref Ke ) I 
saIciryiref ldl A) 
prevJob(ref p „, Jct ) 
works_in(ref w0lksm ) 
gives(refg„fm 
Root {Professor) 



Person(ref rem )] 
Professoriref^i 
Studen«,ref Smk J 
iecn/re(re/ i<m ) 

V .NULL j '■ 



subjecl(ref sulfeel ) 
Mcreditspef^J 
anended_byiref antMJ j 
givenbyfrejd) 
Root( Lecture ) 



After the static analysis of 
((..) where credits > 5 





?iame(ref ncme ) 




^Lsubject{ref ub j ect ) 


age(ref a f) 




1 credits(ref credils ) 


salary(ref sdal j 




attended_by{ref cmnded by ) 


prevJob(ref p ^ JO 


) 


given_by(ref given by ) 


works JiHref„ b 


«) 


Roo t(Lecture) 


gives(ref gl „)' 

Roo {{Professor 





Personiref Fma ) fl 
Prqfessor(ref Pnfm Jk 
Sludent{ref S[Ildenl ) 
Lecmre(ref Llclu J 
WFaculty(ref FacpIl j 
NULL 



After the static analysis of 
((..) where credits >5) 



PersmVof, 
l’rnf .. k !V' . 

,S ■nid*mr*ft, Mk .„) 
/,eawuVof IJFtir J 
Faailnircf : - d , ]ilp ) 
NULL 







integer 




{referedus- Roo t(credils)} 




{ref 'credits- Roo t{credits)} 






{ re f Lecture- Roo t(Lecture ) } . 
{ref Professor Roo t{Professor)} 


{ re f Lecture- Root(I.ecture)} . 
{ref Professor Root(Professor)} 




{ re f Lecture- Root {Lecture)}, 
Professor Ro ° HProfessor)} 



Fig. 6. The third part of the analysis 



After the static analysis of 
((..) where credits > 5). 

2 



After the static analysis of 
((..) where credits > 5).subject 
(2,2a) 



The final state of 
the static stacks 
ES and QRES 



subject(ref subjecl ) 

amnded_by(ref amdtd ^ 
given_b }V ef gnml J 

Roo t(Ledure) 



Person{ref Pers ji 
Prqfessor(ref Pdfmo J 
Stitdent(ref studldl ) 

I FaciJlyiref Fdcdlv ) 

K 'NULL 



mme(ref name ) 
age(ref ) 
salary(ref SQld 
prevJob(ref ^ 

gives(ref fl 

Roo t(Professor) 



) 

,) 



subsection signature 
for subquery (a) \ 



» sPbjecttre /^, ) 
credits(ref credlls ) 
attended_byiref alllddld 

[ f^ n J’ypA r ,j v ') 

Roo t(Leciure) 



z 



subsection signature 
for subquety (b) 



name(ref ncme ) 
age(refA 
salary(ref dlv ) 
prevJob(ref m J 

«orksJn(refMm 

gives(ref gives ) 

Root {Professor) 



Person(ref Pdrl J\ 
Professoriref^^jl 
Sludennref smdem ) 
Ledure{ref Lecm ) 
WFacuhy{ref Fdadty ) 
■N I I 1 



Person{ref Pmo1 M\ 

Professor(ref Pnfm I) 
Studml{rd Frxkm ) 

I ttaitre(re/ i<c „) I 

Fnciillyiruf :- !add ) 

NULL 





{ re f subject- Root (subject)} 


{rcf u „ lre .Root(l,ccture)}. 

1 Boot (Professor) \ 




fsf/tauro Root (Lecture)}, 
Root (Professor)) 



Fig. 7. The fourth part of the analysis 



In Figs. 5, 6, and 7 we present the section signature (at the top of S_ES) 
containing subsection signatures for subqueries (a) and (b). No bindings through 
the entire analysis in the right subsection corresponding to the subquery (b) 
means that the subquery givenJty. Professor is dead. 

We optimize (i) by pruning its syntax tree: we cut off the subtree modeling 
the dead part (along with txi). The new, optimized form of the syntax tree is 
shown in Fig. 8. It models the query ( Lecture where credits > 5). subject. 














38 



J. Plodzien and K. Subieta 




the root of 
the syntax tree 



Fig. 8. The form of the syntax tree for query (i) without the dead subquery 



5 How to Rid a Query of All of Its Dead Parts 

The method presented in the previous section makes it possible to find dead 
parts of a query and remove them. As we will discuss in this section, removing 
dead parts may cause other subqueries to become dead. That is, in some cases 
some parts of a query may turn out to be dead after other dead parts of that 
query have been removed. Let’s start with an example. Suppose that we have a 
view, which returns lectures along with professors who give them, and faculties 
in which those professors work: 
view LectProfFaculty 
begin 

return Lecture ixa given-by. Professor ex works -in. Faculty 

end; 

If we want to use this view to get the subjects of all lectures, we can ask the 
query 

LectProfFaculty. subject 

After macro-substitution it has the form: 

(. Lecture ex giveri-by. Professor ex works-in. Faculty) .subject 

The only dead part is works -in. Faculty (it’s the biggest dead part). The 
subquery given-by. Professor is not used by the subquery subject, but is not 
dead since it is used by the dead subquery; thus it cannot be eliminated. After 
removing the dead part the query has the form 
( Lecture ex given-by. Professor). subject 

Note that now the part given-by. Professor is dead. The reason is that the 
subquery, which used its result, has been removed. Therefore the method should 
be performed in a loop: in one iteration one modified static analysis of the 
current form of a query would be rendered, during which all dead parts would 
be detected and removed; the loop would end when no dead part were found. 







Query Optimization through Removing Dead Subqueries 



39 



6 Other Cases 

So far we have considered queries for which a dead subquery was a part of a join 
subquery and the result of that dead subquery was used to constitute (a part 
of) the top environment on ES. However, in a general case a dead subquery does 
not have to satisfy those conditions. Let’s analyze an example. Suppose we have 
the view 

view Names Of Prof Faculty 

begin 

return Professor.{worksJn.{name as n, Faculty. fname as /)) 

end; 

which for each professor returns his/her name and the name of the faculty lre/she 
works in (those subresults are named n and f respectively). If we want to get 
professors’ names via this view, we can ask the following query: 

Names Of Prof Faculty, n 

After macro-substitution it has the form: 

( Professor.{worksJn.{name as n, Faculty. fname as f))).n (ii) 

The subquery Faculty, fname as / is dead just like the dead subqueries considered 
so far (it is the only dead part of query (ii)); we remove it: 

( Professor.{worksJn.{name as n))).n (iii) 

However, a subquery that was not dead in (ii) turns out to be dead in (iii): the 
subquery works in died, because the only subquery that had used its result (i.e., 
Faculty) has been removed as a component of the dead subquery. 

Note that the dead subquery neither forms a join subquery or its result is 
used to build the top environment on ES, when the final dot operator (which 
is the last operator whose right subquery could use the result of works Jn) is 
being evaluated. Such cases require further research so that our method could 
transform (iii) to 
{Professor. {name as n)).n 

A final remark: In some cases where can cause a subquery to be dead. An 
example is the case when the left subquery of where is a union of queries: 

{Q1 union Q2) where Q3 

It may turn out that the final result of the whole query does not contain the 
result of Q1 (or Q2 ) which makes the subquery dead. This also needs further 
research. 

7 Conclusions 

In the paper we have discussed in detail a special query optimization technique 
based on detecting and removing dead parts of queries through rewriting a query 
into a semantically equivalent form without dead parts. The method is based 
on static analysis of a given query in order to gather all the information that 
we need to detect dead subqueries. The only internal form of a query being 
optimized is a syntax tree. 




40 



J. Plodzien and K. Subieta 



In our research we have used the formal stack-based model, which allows us to 
describe precisely the semantics of query languages for a general object-oriented 
database model. In this setting we have described our approach to views and we 
have explained how we make use of the powerful technique of query modification 
(macro-substitution) . 

As we have pointed there are special cases of dead subqueries that are not 
covered by our method. They require further research. 



References 

1. Fernandez, M., Simeon, J., Wadler, P.: XML Query Languages: Experiences and 
Exemplars, www-db.research.bell-labs.com/user/simeon/xquery.html. (2000). 

2. Cattel, R., Barry, D.: Object Data Standard 3.0. Morgan Kaufmann. (2000). 

3. Ott, N., Horlander, K.: Removing Redundant Join Operations in Queries Involving 
Views. Information Systems 10(3). (1985) 279-288. 

4. Plodzien, J.: Optimization Methods in Object Query Languages. Pli.D. Thesis. 
Institute of Computer Science, Polish Academy of Sciences. (2000). 

5. Plodzien, J., Kraken, A.: Object Query Optimization through Detecting Indepen- 
dent Subqueries. Information Systems 25(8). (2000) 467-490. 

6. Plodzien, J., Subieta, K.: Optimization of Object-Oriented Queries by Factoring 
Out Independent Subqueries. Institute of Computer Science, Polish Academy of 
Sciences. Report 889. (1999). 

7. Plodzien, J., Subieta, K.: Static Analysis of Queries as a Tool for Static Optimiza- 
tion. Proc. of IDEAS, to appear (2001). 

8. Stonebraker, M.: Implementation of Integrity Constraints and Views by Query 
Modification. Proc. of SIGMOD. (1975) 65-78. 

9. Subieta, K.: Mapping Heterogeneous Ontologies through Object Views. Proc. of 
3 rd Workshop on Engineering Federated Information Systems. (2000) 1-10. 

10. Subieta, K., Beeri, C., Matthes, F., Schmidt, J.: A Stack-Based Approach to Query 
Languages. Proc. of 2 nd Inti. East- West Database Workshop. (1995) 159-180. 

11. Subieta, K., Kambayashi, Y., Leszczylowski, J.: Procedures in Object-Oriented 
Query Languages. Proc. of VLDB. (1995) 182-193. 

12. Subieta, K., Plodzien, J.: Object Views and Query Modification. In: Barzdins, 
J., Caplinskas, A. (eds.): Databases and Information Systems. Kluwer Academic 
Publishers. (2001) 3-14. 




The Impact of Buffering on Closest Pairs 
Queries Using R- Trees 



Antonio Corral 1 , Michael Vassilakopoulos 2 , and Yannis Manolopoulos 3 * 

1 Department of Languages and Computation 
University of Almeria, 04120 Almeria, Spain 
acorralOual . es 

2 Lab of Data Engineering, Department of Informatics 
Aristotle University, 54006 Thessaloniki, Greece 
mvass@computer . org 
3 Department of Computer Science 
University of Cyprus, 1678 Nicosia, Cyprus 
manolopo@ucy . ac . cy 



Abstract. In this paper, the most appropriate buffer structure, page 
replacement policy and buffering scheme for closest pairs queries, where 
both spatial datasets are stored in R-trees, are investigated. Three buffer 
structures (i.e. single, hybrid and by levels) over two buffering schemes 
(i.e. local to each R-tree, and global to the query) using several page 
replacement algorithms (e.g. FIFO, LRU, 2Q, etc.) are studied. In order 
to answer K closest pair queries (A'-CPQs, with K > 1) we employ re- 
cursive and non-recursive (iterative) branch-and-bound algorithms. The 
outcome of this study is the derivation of the outperforming configuration 
(in terms of buffer structure, page replacement algorithm and buffering 
scheme) for CPQs. In all cases, the savings in disk accesses is larger for 
a recursive algorithm than for a non-recursive one, in the presence of 
buffer space. Also, the global buffering scheme is more appropriate for 
small or medium buffer sizes for recursive algorithms, whereas the lo- 
cal scheme is the best choice for large buffers. If we use non-recursive 
algorithms, the global buffering scheme is the best choice in all cases. 
Moreover, LRU is the most appropriate page replacement algorithm for 
small or medium buffer sizes for both types of branch-and-bound algo- 
rithms. FIFO and LRU are the best choices for recursive algorithms and 
2Q for the non-recursive ones, when the buffer is large enough. 



1 Introduction 

The use of buffers is very important in DBMSs, since it can improve the perfor- 
mance substantially (reading data from the disk is significantly more expensive 
than reading from a main memory buffer). There exist two basic research di- 
rections that aim at reducing the disk I/O activity and enhancing the system 

* On sabbatical leave from the Department of Informatics, Aristotle University, 54006 
Thessaloniki, Greece, manolopo@csd.auth.gr 

A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 41—54, 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 




42 



A. Corral, M. Vassilakopoulos, and Y. Manolopoulos 



throughput during query processing using buffers. The first one focuses on the 
availability of buffer pages at runtime by adapting memory management tech- 
niques for buffer managers used in operating systems to database systems [1,9, 
13,15]. The second one focuses on query access patterns, where the query opti- 
mizer dictates the query execution plan to the buffer manager, so that the latter 
can allocate and manage its buffer accordingly [4,6,20]. 

The spatial selections, nearest neighbor searches and joins are considered 
the most important queries in spatial databases that are based on R-trees. 
R.-trees [10] are multi-dimensional, height balanced tree structures for secondary 
storage, that handle objects by means of their Minimum Bounding Rectangles 
(MBRs). In [5] a new kind of spatial query, called K closest pairs query (A'-CPQ), 
is presented. It combines join and nearest neighbor queries for discovering the 
K pairs ( K > 1) of spatial objects from two datasets that have the AT smallest 
distances between them (1-CPQ is treated as special case). Like a join query, 
all pairs of objects are candidates for the result. Like a nearest neighbor query, 
proximity metrics are the basis for pruning strategies and the final ordering. 

The main objective of this work is to find the most appropriate buffer struc- 
ture, page replacement policy and buffering scheme for CPQs, where both spatial 
datasets are indexed with R.-trees. Based on experimental results, we draw con- 
clusions about the importance of using an appropriate buffer management for the 
I/O performance of this kind of query. We present a comparative study, where 
several parameters (such as the buffer structure, page replacement algorithms, 
buffering schemes, buffer size in pages, number of pairs in the result K and the 
nature of indexed datasets) and corresponding values are considered. 

The rest of this paper is organized as follows. In Sect- 2, we review the liter- 
ature (CPQs using R.-trees and buffering) and motivate the research topic under 
consideration. In Sect. 3, a brief description of the spatial access method (i.e. 
R.-tree) and the branclr-and-bound algorithms (i.e. recursive and non-recursive) 
for satisfying CPQs are presented. In Sect. 4, in order to study the effect of 
buffering in the performance of this kind of query, we examine combinations 
of buffer structures, page replacement algorithms and buffering schemes. More- 
over, in Sect. 5, an extensive comparative performance study of CPQ algorithms 
over these alternative combinations is presented. Finally, in the last section, 
the conclusions on the contribution of this paper and future research plans are 
summarized. 



2 Related Work and Motivation 

In DBMSs, the buffer manager is responsible for operations in the buffer pool, 
including buffer space assignment to queries, replacement decisions and buffer 
reads and writes in the event of page faults. When buffer space is available, the 
manager decides about the number of pages that are allocated to an activated 
query. This decision may depend on the availability of pages at runtime (page 
replacement algorithms) , or the access pattern of queries (nature of the query) . 
A number of studies focus on adapting memory management techniques used 




The Impact of Buffering on Closest Pairs Queries Using R- Trees 



43 



in operating systems to database systems, such as FIFO, LRU, LFU, Gclock, 
etc. [1,9,13,15]. Other research efforts aim at determining the buffer requirements 
of queries based on their access patterns (the nature of the query) without con- 
sidering the availability of buffer pages at runtime [4,6,20]. 

Since this paper is related to the research directions based on the nature of the 
query, we focus in the most representative papers about the buffer management 
on indices. In [18], an LR.U buffer structure for indices was presented (OLR.U), 
where the addressing space is logically partitioned into L independent regions, 
each managed by a local LR.U chain. In [6] an extensible and dynamic priority- 
based hint mechanism was proposed to design an optimal replacement strategy 
by exploiting the predictable access pattern of indexing methods. An application 
on their hint mechanism was to design a hybrid replacement strategy, combining 
the LR.U and MRU page replacement policies. There are several studies on spatial 
queries involving more than one R.-tree, and most of them examine the use of 
buffering to reduce the I/O activity [3,5,7,11,12,17]. 

All the previous papers involved more than one R.-tree for the query and used 
a buffer pool with LR.U or FIFO replacement policy, but they did not justify the 
use of these policies. In other words, they did not examine several alternatives for 
the buffer structure, or for the page replacement strategies in order to reduce the 
disk activity. In this paper, our objective is to find the most appropriate buffer 
pool structure (i.e. single, hybrid and by levels) over two buffering schemes (i.e. 
local and global) and the best page replacement policy (e.g. FIFO, LR.U, Gclock, 
etc.) for CPQs, where both spatial datasets are indexed by R.-trees. 

3 R- Trees and Algorithms for Closest Pairs Queries 

3.1 R- Trees 

R.-trees [10] are hierarchical, height balanced data structures based on B + -trees, 
used for the dynamic organization of fc-dimensioiial geometric objects that are 
represented by fc-dimensional MBRs. R.-trees obey the following rules. Leaves 
reside on the same level and contain pairs of the form (R, O), where R is the 
MBR containing the object determined by the identifier O, spatially. Internal 
nodes contain pairs of the form (R, P), where P is a pointer to a child of the 
node and R is the MBR. containing (spatially) the rectangles stored in this child. 
Also, internal nodes correspond to MBRs containing (spatially) the MBR. of their 
children. An R.-tree of class (m, M ) has the characteristic that every node, except 
possibly for the root, contains between m and M pairs, where m < |"M/2] . If the 
root is not a leaf, it contains at least two pairs. Figure 1 depicts some rectangles 
on the right and the corresponding R.-tree on the left. Dotted lines denote the 
bounding rectangles of the subtrees that are rooted in inner nodes. 

Many R.-tree variants have appeared in the literature. One of the most popu- 
lar variations is the R.*-tree [2], which follows a sophisticated node split technique 
and is considered to be the most efficient variant of the R.-tree family. In this 
paper, we have chosen R*-trees to perform our experimental study, although in 
the sequel, the terms R.-tree and R.*-tree will be used interchangeably. 




44 



A. Corral, M. Vassilakopoulos, and Y. Manolopoulos 





Fig. 1. An example of an R-tree 



3.2 Algorithms for Closest Pairs Queries 

A new spatial query was presented in [5], called K closest pairs query (A'-CPQ). 
It combines join and nearest neighbor queries for discovering the AT pairs ( K > 1) 
of spatial objects from two datasets that have the K smallest distances between 
them. These queries are defines as follows. 

1-CPQ. Assume two object datasets P and Q (where P yf 0, Q yf 0), stored 
in two R-trees, Rp and Rq , respectively. Find the pair of objects p, p € P x Q, 
such that: dist(p) < dist{p'),Mp' £ (P x Q — {p}), where dist is a Minkowski 
distance of the pairs of P x Q. 

A'-CPQ. Assume two object datasets P and Q (where P yf 0, Q yf 0), stored 
in two R.-trees, Rp and Rq, respectively. Find the K ordered pairs of objects 
pi,P 2 , ■ ■ ■ ,px,Pi £ P x Q, such that: dist(p\) < dist{p 2 ) < ... < dist(pK) < 
dist(jp')yp' £ (PxQ- {pi,p 2 , ■ ■ ■ ,Pk })• 

Metrics (MINIMINDIST, MINMAXDIST and MAXMAXDIST) and prop- 
erties between two MBRs in the fc-dimensional Euclidean space were proposed 
for the 1-CPQ and A"-CPQ in [5] as bounds for the branch-and-bound (recursive 
and non-recursive) algorithms. The recursive branch-and-bound algorithm (with 
a synchronous traversal, following a depth-first search strategy) for processing 
the 1-CPQ between two sets of points stored in two R.-trees with the same height 
can be described by the following steps: 

CPQ1. Start from the roots of the two R.-trees and set the minimum distance 
found so far, T, to oo. 

CPQ2. If you access a pair of internal nodes, then calculate the minimum 
of MINMAXDIST for all possible pairs of MBRs. If this minimum is smaller 
than T, then update T. Calculate MINMINDIST for each possible pair of MBRs. 
Propagate downwards recursively only for those pairs having MINMINDIST<T. 

CPQ3. If you access two leaves, then calculate the distance of each possible 
pair of points. If this distance is smaller than T, then update T. 

The non-recursive branch-and-bound algorithm (with a synchronous traver- 
sal, following a best-first search strategy using a minimum heap) for processing 
the 1-CPQ between two sets of points stored in two R.-trees with the same height 
can be described by the following steps: 





The Impact of Buffering on Closest Pairs Queries Using R- Trees 



45 



CPQ1. Start from the roots of the two R-trees, set T to oo and initialize 
the minimum heap. 

CPQ2. If you access a pair of internal nodes, then calculate the minimum of 
MINMAXDIST for all possible pairs of MBRs. If this minimum is smaller than 
T, then update T. Calculate MINMINDIST for each possible pair of MBRs. 
Insert into the minimum heap those pairs having MINMINDIST<T. 

CPQ3. If you access two leaves, then calculate the distance of each possible 
pair of points. If this distance is smaller that T, then update T. 

CPQ4. If the minimum heap is empty, then stop. 

CPQ5. Get the pair on top of the minimum heap. If this pair has MINMIN- 
DIST>T, then stop. Else, repeat the algorithm from CPQ2 for this pair. 

The pseudo-code of the recursive and non-recursive algorithms can be found 
in the technical report [8]. Moreover, in order to process the AT-CPQ, an extra 
structure that holds the K closest pairs is necessary. More details can be found 
in [5], 

4 Buffer Management 

DBMSs use indices to speed up query processing (e.g. various spatial databases 
use R.-trees). Indices may partly reside in main memory buffers. This reduces re- 
sponse times. The buffering effect should be studied, since even a small number 
of buffer pages can substantially improve the global database performance. Our 
objective is to find the best structure of the buffer pool, the best page replace- 
ment algorithm and the best buffering scheme for the buffer manager in order to 
reduce the number of disk accesses for 7\-CPQs. We propose three structures of 
the buffer pool (i.e. single, hybrid and by levels) managed by a variety of page 
replacement algorithms (e.g. FIFO, LRU, etc.). 

The buffer pool structure will be organized adopting two buffering schemes 
as depicted in Fig. 2. In the first scheme, the buffer pool is split in two parts, 
each one allocated locally to an R.-tree (left part of Fig. 2). We call it, thus, a 
Local buffering scheme. In the second one, the buffer pool is allocated globally 
to the query (right part of Fig. 2), giving rise to a Global buffering scheme. 

In [9] a systematic description of replacement algorithms was presented for a 
single buffer structure. The FIFO (First-In First-Out) algorithm replaces the old- 
est page, even if its reference frequency gives the priority to the youngest page. 
The LFU schema (Least Frequently Used) replaces the page with the lowest 
reference frequency. Gclock consists of a circular decrementing of the reference 
counters until 0 is reached. When a buffer fault occurs, the first page having a 
counter equal to 0 is replaced. The LRU (Least Recently Used) algorithm gives 
the priority to the most recently used page, replacing the page that was the 
least recently used. MR.U (Most Recently Used) is the opposite of LR.U and re- 
places the page that was the most recently used. The LR.U/2 is a particular case 
of LR.U/AT, proposed in [15] for K = 2, replacing the page whose penultimate 
(second-to-last) access is the least recent among all penultimate accesses. LRD 




46 



A. Corral, M. Vassilakopoulos, and Y. Manolopoulos 



(Least Reference Density) is not a page replacement algorithm based on page 
ages, but on its reference density (reference probability) from the first time that a 
page was accessed. The page replacement algorithm LRD rejects from the buffer 
the page with the minimum reference density. Finally, in [16] a page replace- 
ment algorithm for spatial databases, called LRD-Manhattan, was proposed as 
a variation of LRD. 





Fig. 2. Local and Global buffering schemes 



The most representative methods for the hybrid buffer structure are the 
techniques called 2Q and FIFO-LRU. The 2Q algorithm divides the buffer pool in 
two areas: the hot area managed as an LR.U queue and the cold area maintained 
as a FIFO queue [13]. On the first reference of a page, 2Q places it in the cold 
area (FIFO). If the page is re-referenced while in the cold area, then it is moved 
to the hot area (LR.U). Evidently, if a page is not re-referenced while in the cold 
area, it is rejected from the buffer. In order to solve the “correlated references” 
problem, 2Q divides the cold area in two parts, one for pages and another for 
page identifiers. The FIFO_LRU technique works in the same way as 2Q, but the 
hot area is implemented as a FIFO replacement algorithm and the cold area is 
managed with an LR.U policy [1]. 

Here, we present a buffer structure linked to each R.-tree based on its height, 
h, for solving A'-CPQs. This means that the buffer pool is split in h independent 
areas. For each R.-tree level we allocate a number of pages according to its min- 
imum fan-out factor in and its height, with the exception of the root, for which 
we allocate only one page. We create this buffer structure in a bottom-up way, 
trying to set a distribution of pages per level as fair as possible (root level=level 
h — 1 : m°, level h — 2 : to 1 , level h — 3 : to 2 , . . ., level 1 : m h ~ 2 , leaf level=level 
0 : to 11-1 ). In the case of RT-CPQs, pages at lower levels are very important for 
the branch-and-bound algorithms. Besides, we manage these h independent ar- 
eas using a specific page replacement algorithm, for example LR.U (LR.U_L=LR.U 
by levels), or FIFO (FIFO_L=FIFO by levels). 










The Impact of Buffering on Closest Pairs Queries Using R- Trees 



47 



5 Experimentation 

This section summarizes the results of an extensive experimentation that aims 
at measuring and evaluating the behavior of the recursive and non-recursive 
branch-and-bound algorithms for A+CPQs using different structures, schemes, 
policies and buffer sizes. We ignore the effect of path-buffer [5], since it offers 
more advantages to the recursive algorithms, regardless of the page replacement 
policy. 

For our experiments, we have built several R*-trees [2] using the following 
datasets: (a) a real dataset from the Sequoia project [19] consisting of 62.536 
points that represent specific country sites of California (Real), (b) a point 
dataset produced from the real one by moving randomly every point (Real') and 
(c) two datasets of cardinality 62.536 points, which completely overlap and follow 
uniform and skewed distributions [5] . All experiments have run on a Linux work- 
station with 128 Mb of main memory and several Gb of secondary storage, mak- 
ing use of GNU C++ compiler. The page size was 1 Kb, resulting to a maximum 
R*-tree node capacity M = 21 (minimum capacity was set to m = M / 3 = 7, a 
reasonable choice according to [2]). The quantity counted in all experiments was 
the number of disk accesses required to perform the A'-CPQs. 

5.1 K-CPQ Algorithms Using a Local Buffering Scheme 

We now proceed to the performance comparison of the recursive and non-recur- 
sive branch-and-bound algorithms for AT-CPQs using a Local buffering scheme 
in order to investigate the best page replacement policy and buffer structure. 
We used a buffer pool, B , with varying size from 0 to 512 pages, dedicating 
different portions of B to each R*-tree. The datasets joined were Real/ Real' and 
Uniform/Skewed. However, in the sequel we focus on Real/ Real' data sets, since 
both cases gave very similar trends. 

First of all, for the hybrid structure in the Local or Global buffering scheme, 
we have performed several experiments with different B values {B / 2 for each 
R*-tree) using recursive and non-recursive algorithms to derive the best page 
distribution for the hot and cold regions in the buffer. If Bp is the number of 
pages in the local buffer of the R*-tree Rp, the best configuration was <Hot, 
Cold> = < Bp/2, Bp/2 >. Moreover, for the Local buffering scheme, we have 
assigned a varying number of pages to each R*-tree, and the best distribution 
of the buffer was to assign more pages to the largest R*-tree, whatever the type 
(recursive, or non-recursive) of algorithm used. Since in our experimentation 
we have point datasets with identical cardinalities, (B /2, B /2) was the best 
configuration [8]. 

We have run experiments using different page replacement policies over the 
three buffer pool structures. The best policies for the recursive algorithms were 
FIFO and LR.U in case of small buffers (e.g. B < 64), but in case of large buffers 
(e.g. B > 128) LRU_L was slightly better than FIFO and LR.U. FIFO and LRU 
were better than LFU, Gclock, MR.U, LR.U/2 and LRD, because recursion favors 
the youngest and most recently used pages in the backtracking phase and this 




48 



A. Corral, M. Vassilakopoulos, and Y. Manolopoulos 



behavior is slightly improved in case of large buffers organized by levels (FIFCLL 
and LRU_L). On the other hand, for the non-recursive algorithms and small 
buffers (e.g. B < 64), FIFO and LRU were again the best policies, whereas for 
large buffers (e.g. B > 128) 2Q was slightly better than FIFO and LRU. In 
this case, we did not use recursion and the organization of the buffer pool in two 
regions (i.e. hot and cold) provided a good performance, when the search strategy 
was best-first implemented through a heap of minimums and the buffer was large 
enough. For instance, for the recursive 1-CPQ, using a single buffer structure, 
MR.U was 35% worse with respect to the LR.U. Under these conditions, Gclock 
was 4% worse with respect to LR.U, LFU 35% worse than FIFO, LR.U/2 20% 
worse than LR.U, and LRD 32% worse than FIFO. These behaviors are depicted 
in Fig. 3, where different page replacement policies are compared, using the 
recursive algorithm for 1-CPQ in a single buffer structure. Besides, if we include 
a large buffer (e.g. B = 512) with the single structure and the LR.U policy, the 
savings in I/O operations were 73% for the recursive algorithm and 68% for the 
non-recursive one with respect to the absence of buffer space ( B = 0). For the 
non-recursive algorithm the results were very similar. 





Fig. 3. The performance of the 1-CPQ recursive algorithm for various page replacement 
policies and a single buffer structure, as a function of the buffer size 



For the recursive and non-recursive algorithms, in Fig. 4 we illustrate the 
performance of the 1-CPQ recursive (left) and non-recursive (right) algorithms 
for various page replacement policies, as a function of the buffer size. It can be 
seen that the two charts follow the same trend. When the buffer size is small 
(e.g. B < 64), the single structure with LR.U policy is the best (with 6% and 
5% savings for LR.U in comparison with 2Q, for recursive and non-recursive 
algorithms, respectively), the second is the hybrid and the third one is by levels. 
However, in case of large buffers (e.g. B > 128) the difference is almost negligible 
for all page replacement policies, although LR.U_L and 2Q are slightly better that 
the other for the recursive and non-recursive algorithms, respectively. 

The results of the recursive A"-CPQ algorithm for a given buffer size (e.g. 
B = 512) showed that the best behavior was for LR.U_L with a 0.5% improvement 
over LR.U (for all K values), whereas the worst results appeared in the case of 
the hybrid structure (2Q and FIFCLLR.U). For the non-recursive algorithm with 
the same number of buffer pages ( B = 512), the best behavior was for 2Q with a 





The Impact of Buffering on Closest Pairs Queries Using R- Trees 



49 





Fig. 4. The performance of the of 1-CPQ recursive (left) and non-recursive (right) 
algorithms for various page replacement policies, as a function of the buffer size 



0.6% improvement over LRU (for all K values), whereas the worst results were 
for FIFCLLRU [8]. 

In the case of 1-CPQ, the recursive algorithm presents 10% excess of I/O 
activity in comparison to the non-recursive one with the same page replacement 
policy (LR.U), as can be noticed by the gap between the two lines in the left part 
in the Fig. 5. The gap for A'-CPQ is bigger when the K value is incremented; it 
is 25% bigger for K < 10000, but it reaches 45% when K = 100000 (see the right 
part of Fig. 5). Besides, by increasing K values (1.. 100000), the performance of 
the recursive algorithm is not significantly affected; with a buffer of 512 pages 
and the best page replacement algorithm there is an extra cost of 2%. On the 
other hand, this extra cost is about 39% for the non-recursive algorithm using 
the same buffer characteristics. If we do not have any buffer space ( B = 0), 
then increasing K implies an additional cost of 33% for the recursive algorithm 
and 16% for the non-recursive one. Moreover, the recursive variant demonstrates 
savings in the range 73%-82%, when K increases (1.. 100000) and a buffer of 512 
pages is used, in comparison to the no buffer case (.8 = 0). The non-recursive 
algorithm under the same buffer setup results in savings from 68% to 57%. 





Number of Pairs 



Fig. 5. The performance of the 1-CPQ (left) and the A'-CPQ (right) recursive (REC) 
and non-recursive (NREC) algorithms using the best page replacement policies and 
B = 512, as a function of the buffer size 



In Fig. 6, the percentage of I/O cost savings (induced by the use of buffer size 
B > 0 in contrast to not using any buffer) of the A'-CPQ recursive algorithm 
with LRU_L policy (left) and non-recursive algorithm with 2Q policy (right) is 





50 



A. Corral, M. Vassilakopoulos, and Y. Manolopoulos 



depicted. For the recursive algorithm, the percentage of savings grows as buffer 
sizes increase, for all K values, although it is bigger for K = 100000. The be- 
havior of non-recursive algorithm is slightly different. When the buffer becomes 
larger, the percentage of savings also increases, but when we fix the buffer size, 
the increase of K causes a decrease in the percentage of savings. From all these 
results, we notice that the influence of buffering for a Local scheme is more 
important for the recursive algorithm than for the non-recursive one. 




Fig. 6. The I/O cost savings of the A'-CPQ recursive algorithm with LRXLL policy 
(left) and non-recursive algorithm with 2Q policy (right), as a function of B and the 
cardinalities of the data sets 



5.2 AT-CPQ Algorithms Using a Global Buffering Scheme 

For the Global buffering scheme, we have used the same parameters as for the 
Local one in order to investigate the best page replacement policy and buffer 
structure. In particular we used: (a) several replacement algorithms (FIFO, LRU, 
LRU_L, FIFO_L, 2Q and FIFO_LRU) for the three buffer structures, (b) the same 
number of pages for the buffer ( B varying from 0 to 512 pages), and (c) the 
recursive and non-recursive algorithms for iL-CPQ with K varying from 1 to 
100000 . 

We have performed experiments with 1-CPQ using several replacement al- 
gorithms in the Global buffering scheme. When the buffer size was small or 
medium (e.g. B < 128), the single structure with LRU policy was the best (with 
3% savings with respect to 2Q, for recursive and non-recursive algorithms), the 
second was the hybrid and the third one was by levels. Again, when the buffer 
was large (e.g. B > 256) the difference was almost negligible for all page replace- 
ment policies, although FIFO and 2Q were slightly better than the other ones 
for the recursive and non-recursive algorithms, respectively [8]. 

In the left part of Fig. 7, we depict the performance of the recursive A-CPQ 
algorithm for a given buffer size (e.g. B = 512). The best behavior is for FIFO 
with savings of 0.6% in relation to the LRU (for all K values), and the worst 
results are again for the hybrid structure (2Q and FIFO_LRU). On the other 



The Impact of Buffering on Closest Pairs Queries Using R- Trees 



51 



hand, the results of the non-recursive AT-CPQ algorithm are illustrated in the 
right part of Fig. 7 for the same buffer size ( B = 512). The best behavior arises 
for 2Q with savings of 0.6% in relation to LRU (for all K values), and the worst 
results are for FIFCLLR.U. 

For 1-CPQ, the buffering increased the performance of the recursive algo- 
rithm by 9% in comparison to the non-recursive one with the same page re- 
placement policy (LR.U). For A'-CPQ, when the K value was incremented, this 
improvement was 26% approximately for K < 10000 and 47% for K = 100000. 
Besides, for increasing I\ values, the I/O cost of the recursive algorithm was not 
significantly affected, when we had a buffer of 512 pages and the best page re- 
placement algorithm had only an extra cost of 2%. On the other hand, this extra 
cost was about 39% for the non-recursive algorithm using the same buffer char- 
acteristics. Moreover, the recursive variants demonstrated savings in the range 
73%-81% as K increased, between the case of a 512 pages buffer and the case no 
buffer at all {B = 0). The non-recursive algorithm, under the same buffer setup, 
resulted in 68%-57% savings. In general, these results were very similar to the 
Local buffering scheme ones [8] . 





Fig. 7. The performance of the A'-CPQ algorithm for different page replacement poli- 
cies as a function of the buffer size for recursive (left) and non-recursive (right) algo- 
rithms and B = 512 pages 



In Fig. 8, we see the performance of A'-CPQ recursive and non-recursive 
algorithms as a function of buffer size ( B > 0) with LR.U policy. For the recursive 
algorithm, when B > 32, the savings in terms of disk accesses are large and 
almost the same for all K values. However, the savings are considerably less when 
B < 16, whereas for K = 100000 and B = 0 we can notice a characteristic peak. 
For the non-recursive algorithm, the savings trend is similar to the recursive 
one, but for high K values these savings become considerably less than the 
recursive one. For instance, if we have available enough buffer space, the recursive 
algorithm is the best alternative, because it provides an average I/O savings of 
20% in respect to the non-recursive one for AT-CPQ using LR.U. For all these 
results, we notice that the influence of buffering for a Global scheme is more 
important for the recursive algorithm than for the non-recursive one in the K- 
CPQs, when we have enough buffer space. It is the same conclusion to that for 
the Local buffering scheme. 




52 



A. Corral, M. Vassilakopoulos, and Y. Manolopoulos 





Fig. 8. The performance of A-CPQ recursive (left) and non recursive (right) algorithms 
with LRU policy, as a function of the buffer size and the cardinality of the data sets 

5.3 Comparison of the Buffering Schemes for AT-CPQ 

Table 1 contains the results of an exhaustive comparison of the Local and Global 
buffering schemes, using the best buffer structure and page replacement algo- 
rithms for each of them. These results concern the performance of A'-CPQs 
( K > 1) using the recursive (REC) and non-recursive (NREC) algorithms. 



Table 1. Comparison of the Local and Global buffering schemes 



Buffer Size 


8 


16 


32 


64 


128 


256 


512 


REC 


G (LRU) 


G (LRU) 


L (LRU) 


G (LRU) 


G (LRU) 


G (FIFO) 


L (LRU_L) 


NREC 


G (LRU) 


G (LRU) 


G (LRU) 


G (LRU) 


G (2Q) 


G (2Q) 


G (2Q) 



From this table (where L and G stand for Local and Global, respectively), we 
deduce that the Global buffering scheme is the best alternative in most cases, 
except for B = 32 and B = 512 for the recursive algorithm where the Local 
scheme prevails. The difference between the Global and Local schemes is around 
l%-2% in terms of disk accesses for all cases. Since the difference is small, we 
suggest to use the Global buffering scheme, because, in this case, the buffer 
manager may: (a) include and handle more than two R-trees in the same buffer 
area, (b) give priority to a specific R-tree, (c) manage and assign dynamically 
more pages to one R-tree and (d) introduce global optimization techniques. 

Besides, LRU is the most appropriate page replacement algorithm with a sin- 
gle buffer structure when the buffer size is small or medium. On the other hand, 
when the buffer is large the best alternatives are FIFO (single structure) and 
LRUX (structure by level) for the recursive algorithm and 2Q (hybrid structure) 
for the non-recursive one. Since the difference between LRU and the other win- 
ner page replacement algorithms (FIFO, LRU_L and 2Q) is in the range l%-2%, 
we suggest to use LRU as the policy with the best overall stable performance. 



The Impact of Buffering on Closest Pairs Queries Using R- Trees 



53 



6 Conclusions and Future Work 

Efficient processing of closest pairs queries (A'-CPQs with K > 1) is of great 
importance in a wide area of applications like spatial databases, GIS, image 
databases, etc. Buffering is very important in DBMSs, because it improves the 
performance considerably (since reading from disk is orders of magnitude more 
expensive than reading from a buffer). In this paper we have examined the most 
important factors that affect the performance in the presence of a buffer. These 
are: the buffer structure, the page replacement algorithm, and the buffering 
scheme. From the experimentation we deduce the following conclusions: 

— The I/O savings for the recursive algorithm are larger than that of the non- 
recursive one for AT-CPQ when we have enough buffer space. The reason 
is that the use of recursion in a depth-first way is affected by the buffer- 
ing scheme more than the case of a best-first search strategy implemented 
through a heap of minimums. 

— With a fixed buffer size, increasing the number K of pairs in a CPQ for 
the recursive algorithm results in a negligible extra cost with respect to the 
additional cost for the non-recursive one. 

— The Global buffering scheme is more appropriate when the buffer size is 
small or medium for the recursive algorithm, while the Local scheme is the 
best choice for large buffers. On the other hand, if we use the non-recursive 
algorithm, the Global buffering scheme is the best alternative for all cases. 

— LRU is the most appropriate page replacement algorithm with a single buffer 
structure when the buffer size is small or medium, whatever the type (recur- 
sive, or non-recursive) of algorithm for AT-CPQs. On the other hand, when 
the buffer is large, then the best alternatives are FIFO (single structure) 
and LRU_L (structure by levels) for the recursive algorithm and 2Q (hybrid 
structure) for the non-recursive one. 

Future research may include: 

— Study of alternative choices for the buffer structure, page replacement algo- 
rithm and buffering scheme in the Self-CP Q and Semi-CP Q [5], which are 
extensions of 1-CPQ and /\-CPQ. 

— Consideration of other spatial data structures and multi-dimensional data. 

— Development of a cost model, taking into account the effect of buffering to 
analyze the number of disk accesses required for JT-CPQs for R*-trees (along 
the same lines as in [14], where a cost model for range queries in R.-trees has 
been developed). 

References 

1. W. Bridge, A. Joshi, M. Keihl, T. Lahiri, J. Loaiza and N. MacNaughton: “The Or- 
acle Universal Server Buffer”, Proc. 23rd VLDB Conf '., pp. 590-594, Athens, Greece, 
1997. 




54 



A. Corral, M. Vassilakopoulos, and Y. Manolopoulos 



2. N. Beckmann, H.P. Kriegel, R. Schneider and B. Seeger: “The R*-tree: and Efficient 
and Robust Access Method for Points and Rectangles” , Proc. 1990 ACM SIGMOD 
Conf., pp. 322-331, Atlantic City, NJ, 1990. 

3. T. Brinkhoff, H.P. Kriegel and B. Seeger: “Efficient Processing of Spatial Joins 
Using R- Trees”. Proc. 1993 ACM SIGMOD Conf., pp. 237-246, Washington, DC, 
1993. 

4. H.T. Chou and D. J. DeWitt: “An Evaluation of Buffer Management Strategies for 
Relational Database Systems”, Proc. 11th VLDB Conf., pp. 127-141, Stockholm, 
Sweden, 1985. 

5. A. Corral, Y. Manolopoulos, Y. Theodoridis and M. Vassilakopoulos: “Closest 
Pair Queries in Spatial Databases”, Proc. 2000 ACM SIGMOD Conf., pp. 189-200, 
Dallas, TX, 2000. 

6. C.Y. Chan, B.C. Ooi and H. Lu: “Extensible Buffer Management of Indexes” , Proc. 
18th VLDB Conf., pp. 444-454, Vancouver, Canada, 1992. 

7. A. Corral, M. Vassilakopoulos and Y. Manolopoulos: “Algorithms for Joining R- 
Trees and Linear Region Quadtrees” , Proc. 6th SSD Conf, pp. 251-269, Hong Kong, 
China, 1999. 

8. A. Corral, M. Vassilakopoulos and Y. Manolopoulos: “The Impact of Buffering 
on Closest Pairs Queries using R-trees”, Technical Report, Dept, of Informatics, 
Aristotle University of Thessaloniki, February 2001. 

9. W. Effelsberg and T. Harder: “Principles of Database Buffer Management”, ACM 
Transactions on Database Systems, Vol.9, No. 4, pp. 560-595, 1984. 

10. A. Guttman: “R-trees: A Dynamic Index Structure for Spatial Searching”, Proc. 
1984 ACM SIGMOD Conf., pp.47-57, Boston, MA, 1984. 

11. Y.W. Huang, N. Jing and E.A. Rundensteiner: “Spatial Joins Using R-trees: 
Breadth-First Traversal with Global Optimizations”, Proc. 23rd VLDB Conf., 
pp. 396-405, Athens, Greece, 1997. 

12. G.R. Hjaltason and H. Samet: “Incremental Distance Join Algorithms for Spatial 
Databases”, Proc. 1998 ACM SIGMOD Conf., pp. 237-248, Seattle, WA, 1998. 

13. T. Johnson and D. Shasha: “2Q: a Low Overhead High Performance Buffer Man- 
agement Replacement Algorithm”, Proc. 20th VLDB Conf., pp. 439-450, Santiago, 
Chile, 1994. 

14. S.T. Leutenegger and M.A. Lopez: “The Effect of Buffering on the Performance of 
R- Trees”. Proc. ICDE Conf., pp. 164-171, Orlando, FL, 1998. 

15. E.J. O’Neil, P.E. O’Neil and G. Weikum: “The LR.U-K Page Replacement Algo- 
rithm for Database Disk Buffering”, Proc. 1993 ACM SIGMOD Conf., pp. 297-306, 
Washington, DC, 1993. 

16. A. Papadopoulos and Y. Manolopoulos: “Global Page Replacement in Spatial 
Databases”, Proc. DEXA’96, Conf., pp. 855-864, Zurich, Switzerland, 1996. 

17. A. Papadopoulos, P. Rigaux and M. Scholl: “A Performance Evaluation of Spatial 
Join Processing Strategies”, Proc. 6th SSD Conf., pp. 286-307, Hong Kong, China, 
1999. 

18. G.M. Sacco: “Index Access with a Finite Buffer”, Proc. 13th VLDB Conf., pp.301- 
309, Brighton, England, 1987. 

19. M. Stonebraker, J. Frew, K. Gardels and J. Meredith: “The Sequoia 2000 Bench- 
mark”, Proc. 1993 ACM SIGMOD Conf., pp.2-11, Washington, DC, 1993. 

20. G.M. Sacco and M. Schkolnick: “A Mechanism for Managing the Buffer Pool in 
a Relational Database System Using the Hot Set Model”, Proc. 8th VLDB Conf., 
pp. 257-262, 1982. 




Enhancing an Extensible Query Optimizer with Support 
for Multiple Equivalence Types 



Giedrius Slivinskas and Christian S. Jensen 

Department of Computer Science, Aalborg University, Denmark 
http : / /www. cs . auc . dk/~ {giedrius | cs j } 



Abstract. Database management systems are continuously being extended with 
support for new types of data and advanced querying capabilities. In large part 
because of this, query optimization has remained a very active area of research 
throughout the past two decades. At the same time, current commercial optimizers 
are hard to modify, to incorporate desired changes in, e.g., query algebras or 
transformation rules. This has led to a number of research contributions aiming to 
create extensible query optimizers, such as Starburst, Volcano, and OPT++. 

This paper reports on a study that has enhanced Volcano to support a relational 
algebra with added temporal operators, such as temporal join and aggregation. 
These enhancements include the introduction of algorithms and cost formulas 
for the new operators, six types of query equivalences, and accompanying query 
transformation rules. The paper describes extensions to Volcano's structure and 
algorithms and summarizes implementation experiences. 



1 Introduction 

Query optimization has remained subject to active research for more than twenty years. 
Much research has aimed at enhancing existing optimization technology to enable it 
to support the requirements, such as for new types of data and queries, of the many 
and new types of application areas, to which database technology has been introduced 
over the years. However, current commercial optimizers remain hard to extend and 
modify when new operators, algorithms, or transformations have to be added, or when 
cost estimation techniques or search strategies have to be changed [4], As a result, 
the last decade has witnessed substantial efforts aiming to develop extensible query 
optimizers that would make such changes easier. Representative examples of extensible 
query optimizers include Starburst [8], Volcano [7], and OPT++ [11], 

This paper reports on a specific study that has enhanced the Volcano extensible query 
optimizer to support a relational algebra with temporal operators such as temporal join 
and aggregation [15]. In addition to new operators, cost formulas, selectivity-estimation 
formulas, and transformation rules, the algebra offers systematic support for order preser- 
vation and duplicate removal and retention for all queries, as well as for coalescing for 
temporal queries (in coalescing, several tuples with adjacent time periods and otherwise 
identical attribute values are merged into one). To support order, relations are defined as 
lists, and six kinds of relation equivalences are defined - two relations can be equivalent 
as lists, multisets, and sets, and two temporal relations can be snapshot-equivalent as lists, 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 55-69, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




56 



G. Slivinskas and C.S. Jensen 



multisets, and sets. We report on the design decisions and implementation experiences, 
and we evaluate Volcano’s extensibility. 

An important goal of the algebra is to offer a foundation for a layered temporal 
DBMS that may evaluate temporal queries faster than do current DBMSs. The latter 
do not have efficient algorithms for expensive temporal operations such as temporal 
aggregation, while such operations can be evaluated efficiently at the user-application 
level by algorithms that use cursors to access the underlying data [16]. 

New algorithms can be added to a DBMS via, e.g., user-defined routines in In- 
formix [2,9] or PL/SQL procedures in Oracle [13], but these methods currently do not 
allow to define functions that take tables as arguments and return tables [10]; nor do they 
allow to specify transformation rules, cost formulas, and selectivity-estimation formulas 
for the new functions. Because of these limitations, a middleware component with query 
processing capabilities was introduced, which divides the query processing between it- 
self and the underlying DBMS [16]. Intermediate relations can be moved between the 
middleware and the DBMS by the help of transfer operators. 

To adequately divide the processing, the middleware has to take optimization 
decisions - for this purpose, we employ the Volcano extensible optimizer. Use of a 
separate middleware optimizer allows us to take advantage of transformation rules and 
cost and selectivity-estimation formulas specific to the temporal operators. 

This paper summarizes design issues and experiences from the implementation of 
the optimizer. While the addition of new temporal operators, their cost and selectivity- 
estimation formulas, and transformation rules could be done using the extensibility 
framework provided by Volcano, adding support for multiple types of equivalences be- 
tween relations required changes in Volcano structures, and in its search-space generation 
and plan-search algorithms. 

To our knowledge, no existing extensible query optimizers systematically support 
sets, multisets, and lists. Sorting is treated differently than the common operators, such 
as selection or join, and it usually is considered in the query optimization only after 
the search space of possible query plans has been generated. However, particularly 
due to recent introduction and increasing use of TOP N and BOTTOM N predicates in 
queries [3], sorting could be exploited better in query optimization if considered during 
the search-space generation. 

The paper is structured as follows. In Sect. 2, we present Volcano’s architecture. 
Section 3 describes the enhancements to Volcano that were necessary to support the 
algebra introduced above. The algebraic framework is described first, with a focus on 
the parts that posed challenges to Volcano, and the modifications are described next. 
Section 4 summarizes the implementation experiences and evaluates the extensibility of 
Volcano. Section 5 covers related work, and Sect. 6 concludes the paper. 



2 Description of Volcano 

The Volcano Optimizer Generator is a software program for generating extensible query 
optimizers. The input to the program is a query algebra: operators, their implementation 
algorithms (physical operators), transformation rules, and implementation rules. Trans- 
formation rules specify equivalent logical expressions, and implementation rules specify 




Enhancing an Extensible Query Optimizer with Support 



57 



which algorithms implement which operators. The output is an optimizer, which takes 
a query in the given algebra as input and returns a physical expression (an expression of 
algorithms) representing the chosen query evaluation plan. The optimizer implementor’s 
tasks include the specification of the input and the coding of the support functions - such 
as the selectivity estimation - for operators and rules. 



2.1 Two Stages of Query Optimization 

The Volcano optimizer optimizes queries in two stages. First, the optimizer generates the 
entire search space consisting of logical expressions generated using the initial query plan 
(to which the query is mapped to) and the set of transformation rules. The search space is 
represented by a number of equivalence classes. An equivalence class may contain one 
or more logically equivalent expressions, also called elements; each of these includes 
an operator, its parameter (for example, predicates for the selection), and pointers to its 
inputs (which are also equivalence classes). 

Consider a simple example query, which per- 
forms a join on the EmpID attribute of POSITION 
and SALARY relations. Its one possible initial plan 
is shown in Fig. 1 (a) and its search space is shown 
in Fig. 1 (b). The elements of classes 1 and 2 repre- 
sent logical expressions returning partial results 
of the query, i.e,, the operators retrieving, respec- 
tively, the POSITION and SALARY relations. The 
elements of class 3 represent logical expressions 
returning the result of the complete query; either 
the first or the second element may be used. Es- 
sentially, the given search space represents only 
two plans which differ in the order of the join 
arguments. 

During the second stage of Volcano’s opti- 
mization process, the search for the best plan is 
performed. Here, the implementation rules are 
used to replace operators by algorithms, and the 
costs of diverse subplans are estimated. For the 
given query, the number of plans to be considered 
is greater than two, because the relations may be 
retrieved by using either full scan or index scan, 
and the join may be implemented by, e.g., nested- 
loop, sort-merge, or index join. One possible evaluation plan is to scan both relations 
and perform a nested-loop join. 

The following two sections briefly describe the search-space generation and the 
plan-search algorithms; for more detail, we refer to [7], 



X 



EmpID 



get get 

POSITION SALARY 

(a) 




(b) 



ig. 1. A Query Plan and its Search 
pace 







58 



G. Slivinskas and C.S. Jensen 



2.2 Stage One: Search-Space Generation 

The search-space generation is performed by the Generate function. Initially, one ele- 
ment is created for each operator in the original query expression, and then Generate 
is invoked on the top element. 

The Generate function repeatedly applies transformation rules to the given element, 
choosing among the applicable rules that have not so far been applied to the element. 
The application of a transformation rule may trigger the creation of new elements and 
classes; for each newly generated element, the Generate function is invoked. 

For the query in Fig. 1(a), the search space is generated as follows. Initially, three 
elements representing the three query-tree operators are created (the first elements of 
equivalence classes 1-3 in Fig. 1(b)). Then, the Generate function is invoked for the 
first element of class 3, which, in turn, invokes Generate for the first elements of classes 
1 and 2. The latter two Generate calls do not do anything because no rules apply to the 
elements of class 1 and 2. For the first element of class 3, however, the join commutativity 
rule is applied, and a second element pointing to switched join arguments is added to 
class 3. Then, the Generate function is invoked on the new element of class 3, but no 
new elements are generated: the join commutativity rule is applied again, but its resulting 
right-hand element already exists in the search space. 



2.3 Stage Two: Plan Search 

When searching for a plan, the Volcano optimizer employs dynamic programming in a 
top-down manner, and it uses the FindBestPlan function recursively. 

First, the optimizer invokes the FindBestPlan function for the first element of the 
top equivalence class - e.g., class 3 in Fig. 1 (b) - and a cost limit of infinity (the cost limit 
can be lower in subsequent calls to the function). If all elements of the class containing the 
argument element have already been optimized, no further optimization for the element 
is necessary: if the plan has been found and its cost is lower than the cost limit, it is 
returned, if not - NULL is returned. Otherwise, optimization has to be performed. 

During the optimization, for each algorithm implementing the top operator (in our 
case, join), FindBestPlan is recursively invoked on the inputs to the algorithm. If 
optimization of the inputs is successful, the plan with the algorithm yielding the cheapest 
expected cost is chosen as the best plan. Then, FindBestPlan is recursively invoked for 
each equivalent logical expression (in our case, for the second element in equivalence 
class 3) to see if a better plan can be found. In case a better plan is found, it is saved in 
memory as the best one. 



3 Enhancement of Volcano 

The implementation of the algebra and its accompanying transformation rules intro- 
duces several concepts that did not exist previously in Volcano; these new concepts are 
described in Sect. 3.1. Sections 3.2 and 3.3 concern the actual implementation. 




Enhancing an Extensible Query Optimizer with Support 



59 



3.1 Algebra and Multi-equivalence Transformation Rules 

First, we overview the architecture for which the algebra has been designed. Next, we 
describe the actual algebra, the accompanying transformation rules, and their applica- 
bility, focusing on the new concepts. Finally, we outline the challenges that these new 
concepts pose to Volcano. 

Architecture. The temporally extended relational algebra [15] has been designed for an 
architecture consisting of a middleware component and an underlying DBMS. Expensive 
temporal operations such as temporal aggregation do not have efficient algorithms in 
the DBMS, but can be evaluated efficiently by the middleware, which uses a cursor to 
access DBMS relations [16]. Consequently, query processing is divided between the 
middleware and the DBMS; the main processing medium is still the DBMS, but the 
middleware is used when this can yield better performance. 

Algebra. The algebra differs from the conventional relational algebra in several aspects. 
First, it includes temporal operators such as temporal join and temporal aggregation. 
Next, it contains two transfer operators that allow to partition the query processing be- 
tween the middleware and the DBMS. Finally, the algebra provides a consistent handling 
of duplicates and order at logical level, by treating duplicate elimination and sorting as 
other logical operators and by introducing six types of relation equivalences. 

Two relations are equivalent (1) as lists if they are identical lists ( = L ); (2) as mul- 
tisets if they are identical multisets taking into account duplicates, but not order ( = M ); 
and (3) as sets if they are identical sets, ignoring duplicates and order ( _: s ). Two tempo- 
ral relations are snapshot-list ( =£ ), snapshot-multiset( = S M ), or snaphot-set equivalent 
( =® ), if their snapshots (projections at a given point in time) are equivalent as lists, 
multisets, or sets. 

Figure 2 shows two temporal relations (relations having two attributes indicating a 
time period), POSITION and SALARY. We assume a closed-open representation for time 
periods and assume the time values for T1 and T2 denote months during some year. For 
example, Tom was occupying position Posl from February to August (not including 
the latter). 



POSITION SALARY Result 



PosID 


EmpID 


Name 


T1 


T2 


Posl 


1 


Tom 


2 


8 


Pos2 


2 


Jane 


3 


8 



EmpID 


Amount 


T1 


T2 


1 


100K 


2 


6 


1 


12 OK 


6 


9 


2 


110K 


3 


8 



EmpID 


Name 


PosID 


Amount 


T1 


T2 


1 


Tom 


Posl 


100K 


2 


6 


1 


Tom 


Posl 


12 OK 


6 


8 


2 


Jane 


Pos2 


110K 


3 


8 



Fig. 2. Relations POSITION and SALARY, and the Result of Temporal Join 



A temporal join is a regular join, but with a selection on the time attributes, ensuring 
that the joined tuples have overlapping time periods; Figure 2 shows the result of temporal 
join on the EmpID attribute of the POSITION and SALARY relations. 




60 



G. Slivinskas and C.S. Jensen 



Transformation Rules. Six types of equivalences lead to six types of transformation rules, 
since a transformation rule may satisfy several of the six equivalences. Let us consider 
two rules for temporal join, X T . For a given rule, we always specify the strongest 
equivalence type that holds; the ordering of equivalence types is given in Fig. 3. The 
join commutativity rule riX T r2 — > M r2\X\ T r± says that the relations resulting from 
the left-hand and right-hand sides are equivalent as multisets (and, according to the 
type ordering, as sets, as well as their snapshots are equivalent as sets and multisets). 
Meanwhile, the sort push-down rule sorf^riXI 7 ^^) —> L sort a(t i)tX!i T r2, where A 
belongs to the attribute schema of r± and the left-hand side operations are located in 
the middleware, says that the relations are equivalent as lists and that the other five 
equivalence types also hold. 1 The latter rule exploits the fact that all temporal join 
algorithms in the middleware retain the sorting of their left arguments. 

Applicability of Transformation Rules. Transformation rules that do not guarantee = L 
equivalence cannot always be applied, as illustrated by the following example. Consider 
a query that performs the above-mentioned temporal join and sorts the result by Name. 
One possible initial plan for this query is shown in Fig. 4(a). The bottom operators 
represent relations POSITION and SALARY transferred to the middleware; to achieve this, 
at least two operations are necessary (a table scan in the DBMS and the actual transfer), 
but to simplify the example, we view them as one operation and do not consider any 
transformation rules related to these operations. Temporal join and sorting are performed 
in the middleware. 

Let us consider rule riX T r2 — > M f2M T r\. This rule can 
be applied to switch the arguments of the join. However, if we 
apply the sort push-down rule first and move the sorting below the 
temporal join, before the temporal join’s left argument (leading to 
the plan shown in Fig. 4(b)), the application of join commutativity 
rule would lead to an incorrectly ordered query result. Thus, to be 
able to tell when an — > M rule is applicable, the optimizer needs to 
know the importance of order at each node in the query tree, i.e., 
whether the result of the operation at the node has to preserve some 
order or not. In the algebra, this importance is determined by the 
OrderRequired property. To determine the applicability of rules 
of other types, two additional properties, DiiplicatesRelevant and 
PeriodPreserving, are used; the first is True if the operation at the node cannot arbitrarily 
add or remove duplicates, and the second is True if the operation at the node cannot 
replace its result with a snapshot-equivalent one. For each rule of a given type, Table 1 
shows the applicability condition for operator nodes on the left-hand side of the rule. 

Having an initial query plan, the properties for operators are set in a top-down 
manner and then adjusted every time a new transformation rule is applied. For the top 
operator, the properties are set in accordance with the specific user-level query language 
and query statement, e.g., an SQL query requires the result to be sorted if the ORDER 
BY clause is specified at the outer-most level. Consequently, for the top element, the 
OrderRequired property is set to True only if the ORDER BY clause is specified at 
the outer-most level. The Duplicates Relevant and PeriodPreserving properties are 
always set to True, because we always care about duplicates and time periods. For the 

1 To be precise, the relations are =l,a equivalent, i.e., their projections on A are = L equivalent. 

We will use = L equivalence for simplicity. 



L 




SS 



Fig. 3. Ordering of 
Equivalence Types 




Enhancing an Extensible Query Optimizer with Support 



61 



sort 

I Name 





get 

POSITION 



get 

SALARY 



get 

POSITION 



(a) 



(b) 



Fig. 4. Query Plans 



other operators, the properties are set according to the property values of their parents, 
e.g., if some operator is the input to the sort operator, its OrderRequired property will be 
set to False, because its resulting relation may be replaced (via some transformation rule) 
by a multiset-equivalent relation, and the correct order of the result will still be ensured 
by the following sort operator. For more details about setting the property values, we 
refer to [15], 

Support in Volcano. Volcano provides a framework of adding new operators and trans- 
formation rules, which allows a rather straightforward addition of temporal operators 
and transfer operators, their cost formulas, selectivity-estimation formulas, and schema 
propagation formulas. The difficult part is to incorporate different types of transforma- 
tion rules. While different rule types can be added by just introducing an extra type 
attribute to each rule, to control their applicability is more difficult. The property mech- 
anism cannot directly be used because of Volcano’s search-space structure. Having a 
Volcano search space, values of the three properties cannot be determined for an ele- 
ment, because it is impossible to know the property values of the elements above since 
the same equivalence-class element may be used as input by different elements of differ- 
ent equivalence classes, as shown later in Fig. 5 where the first element of equivalence 
class 2 is used both by two elements of equivalence class 3 and by two elements of class 
4. Therefore, the determination of the properties can only occur during the actual search, 
which is performed top down. 



Table 1. Applicability of a Rule According to its Type 



Rule type 


Applicability condition, Mop £ Ihs 


^ L 


True 


y M 


-i Order Required (op) 


-As 


-i < Duplicates Relevant (op ) A Order Required {op) 




-'PeriodPreserving (op) 


V S 

' M 


-i Order Required (op) A -> PeriodPreserving (op) 


x s 


-iDuplicatesRelevant(op) A -i Order Required[op) A -> PeriodPreserving (op) 





62 G. Slivinskas and C.S. Jensen 

3.2 Adjustment of the Search-Space Generation 



Since it is impossible to determine properties during the search-space generation, we 
generate a complete search space by applying transformation rules of all types, and 
then filter away invalid elements during the actual search. The identification of invalid 
elements is enabled by recording, for each element, a type that represents the combi- 
nations of the three property values for which this element may be used. We use six 
possible type values - L, M, S, SL, SM, SS - which correspond to the six equivalence 
types. Consequently, the relationship between each element type and the combination 
of properties corresponds to Table 1. For example, if all properties are True, only L type 
elements are valid. Intuitively, the element type tells how the relation generated by this 
element will be equivalent to the first element of the equivalence class. 

Figure 5 shows the search space for the query in Fig. 4(a), generated using the 
join commutativity rule and the sort push-down rule (the first one guarantees = M 
equivalence, while the second one guarantees = L equivalence). Initially, four elements 
representing the four query-tree operators are created (the first elements of equivalence 
classes 1-4). Then, the join commutativity rule is applied to the first element of class 3, 
and a second element representing switched join arguments is added to the class. The 
sort push-down rule is applied to the first element of class 4, and two new elements are 
created, one of which is added to class 4 and one of which becomes the only element of 
class 5. Finally, the join commutativity rule is applied to the second element of class 4, 
yielding the third element in the class. 




Fig. 5. Search Space 











Enhancing an Extensible Query Optimizer with Support 



63 



The first elements of classes 1-3 have equivalence type M only, because the base 
relations are retrieved from the DBMS, and we do not know in which order the DBMS 
will deliver them. It may happen that a subquery whose top element is the first element 
of class 3, when run twice, would return relations that are only multiset equivalent. 

The third element of class 4 is only = M equivalent to the other two elements of that 
class. Since the query requires a sorted result (the OrderRequired property value for the 
top operator is True), only the two first elements of class 4 will be used during plan 
search. Below, we discuss how the element types are determined. 

During the search-space generation, new elements are added after applying trans- 
formation rules. For a transformation rule, we give below a procedure for how to set the 
types of elements resulting from the right-hand side of the rule, 

1. The top-element type (the element representing the top operator in the right-hand 
side of the rule) is set to the type which is the greatest common descendant of the 
transformation-rule type and the types of the elements participating in the left-hand 
side of the transformation rule. 

2. The top-element type is set to a stronger type than specified in 1 only if the right- 
hand side contains an operation - such as sorting or duplicate elimination - that 
would enforce a “stronger” equivalence between the new top element and the old 
top element. 

3. The types of other new elements resulting from the right-hand side of the rule are 
set to any value, but they have to be equal to or stronger than the top-element type. 

For example, the greatest common descendant of types M and SM is SS. Let us 
consider the search space in Fig. 5: the join commutativity rule applied to the second 
element of class 4 results in the third element of class 4, and its type is set to M, which is 
the greatest common descendant of L (the type of the second element) and M (the type 
of the rule). 

Now let us consider another query, which performs a selection on relation r trans- 
ferred to the middleware and then sorts it; see its search space in Fig. 6(a). After trans- 
formation rule sort p(t)) -^- l up (sort a(p)) is applied to the first element of class 
3, the new top element - which becomes the second element of class 3 - is of type 
L (Fig. 6(b)). Even if the sorting is not at the top level, the result is correctly ordered 
because the selection retains the order of its argument. 

In the given examples, the types of the new non-top elements are set to L. Generally, 
the types of non-top elements are not important for the correctness, as long as they are not 
descendants of the new top element type (see the equivalence-type ordering in Fig. 3). 
Therefore, they should be set aiming to have as small search space as possible, i.e., if 
an element has to be inserted, first we can look in the existing search space if the same 
element (with any type) exists there, and if it does, we do not need to insert it anew. If no 
elements exist, a new element should have L type, because most rules are of — > L type 
and it is likely that, if this element is to be attempted to be inserted again as a top-level 
element, its type will be L. 




64 



G. Slivinskas and C.S. Jensen 




(a) (b) 

Fig. 6. Search Space Before (a) and After (b) Applying sortA{op{r)) —>l ap(sortA(r)) 



3.3 Modification of the Plan Search 

For the actual search, the code that controls the validity of elements depending on their 
type has to be added to Volcano. 

The most significant change is the addition of properties to the parameter list of the 
FindBestPlan function. The function uses its input properties to check the validity of 
its input element, as mentioned in Sect. 3.2, as well as to set the parameter properties 
for calling itself recursively on the inputs to its input element. 

Since equivalence-class elements might be of any of the six different types, each 
equivalence class may have up to six physical plans, because plans for different-type 
elements might differ. For example, it is likely that a type M plan will be simpler and 
less costly than a type L plan. In the FindBestPlan function, when looking if a plan 
already exists for the input element, we have look for a plan of a type that is stronger 
than or equal to the input-element type. 



4 Experiences 

In this section, we consider the extensibility of Volcano in relation to the needs of our 
framework. We evaluate its support for multiple types of equivalence, discuss other 
extensions, and evaluate the ease of extensibility. 



4.1 Support for Multiple Types of Equivalences 

When considering multiple types of equivalences, sorting, duplicate elimination, and 
coalescing are important operations, because they may change the equivalence type 












Enhancing an Extensible Query Optimizer with Support 



65 



between two relations. For example, if two = M equivalent relations are sorted on A, 
their sorted versions will be = L]A equivalent. 

Coalescing and duplicate elimination were not implemented in Volcano, and sorting 
is supported by the so-called physical properties of an equivalence class. The possible 
use of sorting algorithms (termed enforcers ) is considered during the second phase 
(plan search) of query optimization. Physical properties are passed as arguments to the 
FindBestPlan function, and they allow the optimizer to consider different positions of 
sort enforcers. The use of physical properties increases the code complexity and size - 
for each algorithm implementing an operator, the optimizer implementor has to write 
functions deriving physical properties of the algorithm’s inputs, checking whether the 
algorithm satisfies required physical properties, and finding physical properties that are 
required from the algorithms’s inputs. 

In our approach, we treat sorting, duplicate elimination, and coalescing as all the 
other operators and exploit them in the search-space generation, not using physical 
properties. While it may be possible to pursue a direction where sorting, duplicate elim- 
ination, and coalescing are all treated as enforcers and employ physical properties, we 
feel that this treatment would add unneccesary complexity to the framework because, 
fundamentally, sorting, coalescing, and duplicate elimination are just like other opera- 
tors, having their transformation rules and statistics-derivation formulas. Treating them 
as algorithms reduces the number of transformation rules, but the complexity in the plan- 
search algorithm is greatly increased. In addition, it would be problematic to incorporate 
the statistics-estimation formulas for duplicate elimination and coalescing. 



4.2 Other Useful Extensions 

Our implementation has indicated the need for new or better support in a number of 
other areas. 

The two-stage query optimization of Volcano forced us to apply all types of trans- 
formation rules during the first stage. If one stage with a top-down plan search and 
generation had been used, it would have been easier to control the applicability of the 
different types of rules and, possibly, would have improved performance. 

The search strategy of Volcano is fixed, and no mechanisms for extending or changing 
it are provided. Proposed improvements of Volcano that were not part of the available 
code include a mechanism for heuristic guidance, where rules can be ordered according 
to their “promise” [7]. Such ordering implies that the rules having the best probability to 
yield better plans would be applied as soon as possible, reducing the overall plan-search 
time. 

We had to add support for equivalence-class elements that point to their own equiv- 
alence classes, because this facility was not available in the code supplied. The pointing 
to the same class often occurs using different equivalence types. For example, sorted 
relation r is multiset equivalent to r, yielding to a class with two elements (one for r 
sorted and one for r) where the first (sorting) points to the same class. In addition, we 
had to implement the linking of classes; the linking is needed when we apply a rule to 
an element of a certain class and find that the resulting element already exists in some 
other class, meaning that both classes represent the same logical expression. 




66 



G. Slivinskas and C.S. Jensen 



The cardinality of a relation resulting from some equivalence class is estimated 
when the class is created, according to the selectivity estimation method of the operator 
represented in the first element. When a new element is added to the class, the cardinality 
is not reestimated. However, the new element may represent an operator for which we 
may have a better method for estimating the cardinality. For example, it is easier to 
estimate the size of a join, than the size of a Cartesian product followed by a selection 
and a projection. Therefore, we had to ensure that the initial plan would contain operators 
with good cardinality estimation methods. 

4.3 Ease of Extensibility 

The main challenge for an extensible query optimizer is to balance the efficiency and ex- 
tensibility, and our study indicates that Volcano’s main emphasis is put on the first aspect. 
Volcano is coded in C and does not follow the object-oriented paradigm, which leads to 
many interconnected structures, which in turn posed difficulties in figuring out where 
the structures were defined, initialized, and used. The transformation-rule application 
code is being generated automatically and does not follow any style guidelines, making 
it difficult to modify (which was needed when incorporating the necessary modifications 
in the search-space generation). A lot of arrays and structures have predefined sizes and 
were not being allocated dynamically, occupying more memory than necessary and pro- 
viding low scalability. On the other hand, the running times of Volcano (for queries not 
involving many joins) were quite low, as shown in [16]. 

The actual implementation tasks, their difficulty, and approximate number of lines of 
resulting code are summarized in Table 2. We divide the entire implementation effort into 
three subtasks. The first one, adding support for multiple equivalence types, is the most 
difficult, and it has been described in the previous sections. Yet the amount of resulting 
code was rather small. The other task was to add new operators, and while it resulted 
in a substantial amount of code, it was not difficult, after learning Volcano’s provided 
framework for adding new operators and transformation rules. The same applies to the 
last task of adding new algorithms; there, however, the amount of code was smaller, 
because we did not use physical properties. 

Support functions form the biggest part of the code added by the optimizer im- 
plementor and their size is proportional to the number of operators and algorithms 
implemented. In our case, we implemented relation retrieval, selection, projection, join, 
sort-preserving join, temporal join, Cartesian product, duplicate elimination, aggrega- 
tion, temporal aggregation, and two transfer operators. Similar behavior of many of 
these operators (particularly, in the propagation of catalog information) resulted in a lot 
of code repetition in corresponding support functions. 

5 Related Work 

Our paper takes its outset in the algebraic framework presented in [15]. The framework 
has been validated by implementing it using the Volcano optimizer and the XXL library of 
query evaluation algorithms; the architecture, cost and selectivity-estimation formulas, 
and performance studies have been reported in [16]. The latter paper did not cover the 
enhancements to Volcano, which are the foci of this paper. 




Enhancing an Extensible Query Optimizer with Support 



67 



Table 2. Tasks. Their Complexity, and Amount of Code 



Task 


Complexity 


Lines of Code 


Adding equivalence-type support 






Modifying structures 


medium 


< 200 


Modifying search-space generation 


high 


< 200 


Modifying plan search 


high 


< 200 


Adding new operators 






Coding support functions 


medium 


~ 2500 


Coding management of the three properties 


medium 


~ 400 


Coding transformation rules 


medium 


~ 2300 


Adding new algorithms 






Coding support functions 


low 


~ 1300 


Coding implementation mles 


medium 


< 200 



While to our knowledge, nobody has enhanced existing optimizers with support for 
sets, multisets, and lists, reference [1] reports on experiences from building the query 
optimizer for Texas Instruments’ Open OODB system using Volcano. That paper finds 
the optimization framework useful, but mentions that much time was spent on writing 
support and cost functions and that the interface for these tasks is not user-friendly. We 
agree with these statements, and we draw additional conclusions in Sect, 4. 

A number of other extensible query optimizers exist. Volcano evolved from the Exo- 
dus optimizer [6], and later was enhanced by the Cascades optimization framework [5], 
which provides a clean interface and implementation that makes full use of C++ classes, 
as well as more closely integrates transformation rules and implementation rules, which 
are distinct sets in Volcano. Since Cascades was intended to be used for Microsoft’s 
SQL Server, its code is not available. Neither is the code for the Starburst query opti- 
mizer [8] used in IBM’s DB2, nor is the code of the EROC toolkit for building query 
optimizers [12], 

The OPT++ [11] extensible optimizer also uses an object-oriented design with C++ 
classes to simplify the extension tasks. OPT++ offers a number of search strategies, 
including “bottom-up” system R-style [14] and the Volcano search strategy; and it can 
emulate both Starburst and Volcano. 



6 Conclusions 

A number of extensible query optimizers are available that aim to facilitate changes in 
query algebras and additions of new functionality. Our study reports on the enhancement 
of one prominent such extensible query optimizer, Volcano, to support an extended 
relational algebra, which - in addition to new temporal operators - contains six types 
of equivalences between relations that lead to six corresponding types of transformation 
rules. We describe how Volcano’s search-space generation and plan search were modified 
in order to support the algebra, and we evaluate the extensibility of Volcano. 

The study indicates that support for sets, multisets, and lists is difficult to add to a pre- 
existing extensible query optimizer - such support should be considered already during 





68 



G. Slivinskas and C.S. Jensen 



the design of an extensible query optimizer. Volcano’s two-staged optimization strategy 
forces the application of all transformation rules, disregarding their type, during the 
first stage; if the optimization had occurred in a single stage, we speculate that it would 
have been easier to control the applicability of rule types and that better performance 
would have resulted. We also found that, for the modifications we considered. Volcano’s 
interface was not always user-friendly and that the amount of code needed to implement 
support functions was quite substantial. On the other hand, we found Volcano to be a 
very useful tool that allowed us to validate our algebra in the middleware architecture 
more quickly than if we would have had to develop our own optimizer. 

This study indicates that extensible query optimizers are useful when testing research 
ideas and building prototypes. We also believe that extensible optimizers, if developed 
in industrial strength versions, will prove very useful when building middleware sys- 
tems that focus on specific functionality suitable for applying conventional relational 
query optimization techniques. The application of extensible technology to middleware 
systems is a promising research direction. Due to the increasing use of user-defined rou- 
tines in conventional DBMSs, optimizer extensibility is also important when creating 
new DBMSs or modifying existing ones. Finally, the study reported upon here indicates 
that more research is needed in query optimization and processing that offer integrated 
support for sets, multisets, and lists. 

Acknowledgements. We are grateful to Richard Snodgrass, who took part in the research 
that lead to the query optimization framework, the Volcano-based implementation of 
which is reported in this paper. 

The research reported here was supported in part by the Wireless Information Man- 
agement network, funded by the Nordic Academy for Advanced Study through grant 
000389, by the Danish Technical Research Council through grant 9700780, and by a 
grant from the Nykredit Corporation. 



References 

1. J. A. Blakeley, W. J. McKenna, and G. Graefe. Experiences Building the Open OODB Query 
Optimizer. In Proceedings of ACM SIGMOD, pp. 287-296 (1993). 

2. R. Bliujute, S. Saltenis, G. Slivinskas, and C. S. Jensen. Developing a DataBlade for a New 
Index In Proceedings of IEEE ICDE, pp. 314-323 (1999). 

3. M. J. Carey and D. Kossmann. Processing Top N and Bottom N Queries. Data Engineering 
Bulletin , 20(3):12— 19 (1997). 

4. S. Chaudhuri. An Overview of Query Optimization in Relational Systems. In Proceedings of 
ACM PODS, pp. 34-43 (1998). 

5. G. Graefe. The Cascades Framework for Query Optimization. Data Engineering Bulletin, 
18(3): 19—29 (1995). 

6. G. Graefe and D. J. DeWitt. The Exodus Optimizer Generator. In Proceedings of ACM 
SIGMOD, pp. 160-172 (1987). 

7. G. Graefe and W. J. McKenna. The Volcano Optimizer Generator: Extensibility and Efficient 
Search. In Proceedings of IEEE ICDE, pp. 209-218 (1993). 

8. L. M. Haas et al. Starburst Mid-Flight: As the Dust Clears. IEEETKDE, 2(1): 143-160 (1990). 

9. Informix Software. DataBlade Overview. URL: <www.informix.com/products/ 
options/udo/datablade/>, current as of May 29, 2001. 




Enhancing an Extensible Query Optimizer with Support 



69 



10. M. Jaedicke and B. Mitschang. User-Defined Table Operators: Enhancing Extensibility for 
ORDBMS. In Proceedings ofVLDB , pp. 494-505 (1999). 

11. N. Kabra and D. J. DeWitt. OPT++: An Object-Oriented Implementation for Extensible 
Database Query Optimization. VLDB Journal, 8(1):55— 78 (1999). 

12. W. J. McKenna, L. Burger, C. Hoang, and M. Truong. EROC: A Toolkit for Building NEATO 
Query Optimizers. In Proceedings ofVLDB, pp. 111-121 (1996). 

13. Oracle Technology Network. Overview of PL/SQL. URL: <otn.oracle.com/tech/ 
pl_sql/>, current as of May 29, 2001. 

14. P. G. Selinger et al. Access Path Selection in a Relational Database Management System. In 
Proceedings of ACM SIGMOD, pp. 23-34 (1979). 

15. G. Slivinskas, C. S. Jensen, and R. T. Snodgrass. A Foundation for Conventional and Temporal 
Query Optimization Addressing Duplicates and Ordering. IEEE TKDE, 13(1):21— 49 (2001 ). 

16. G. Slivinskas, C. S. Jensen, and R. T. Snodgrass. Adaptable Query Optimization and Evalu- 
ation in Temporal Middleware. In Proceedings of ACM SIGMOD, pp. 127-138 (2001). 




Information Sources Registration at a Subject 
Mediator as Compositional Development 



Dmitry O. Briukhov, Leonid A. Kalinichenko, and Nikolay A. Skvortsov 

Institute for Problems of Informatics RAS 
{brd , leonidk , scvorajSsynth . ipi .ac.ru 



Abstract. Method for heterogeneous information source registration at 
subject mediators with local as view (LAV) organization is presented. 
LAV approach considers schemas exported by sources as materialized 
views over virtual classes of the mediator. This approach is intended to 
cope with a dynamic, possibly incomplete set of sources. To disseminate 
the information sources, their providers should register them at a respec- 
tive subject mediator. Such registration can be done concurrently and at 
any time. 

The registration method proposed is new and contributes to the follow- 
ing. The method is applicable to wide class of source specification models 
representable in hybrid semistructured/object canonical mediator model. 
Ontological specifications are used for identification of mediator classes 
semantically relevant to a source class. Maximal subset of source informa- 
tion relevant to the mediator classes is identified. Concretizing types are 
defined so that federated classes instance types are refined by the source 
instance type. This direction naturally supports query planning refining 
a mediator query in terms of a specific source. Such refining direction 
is in contrast to conventional compositional development where specifi- 
cation of requirements is to be refined by specifications of components. 
Such inversion is natural for the registration process: a materialized view 
(requirements) is constructed over virtual specifications (components). 



1 Introduction 

Mediation of heterogeneous information sources provides an approach for intel- 
ligent information integration. Mediation architecture introduced in [20] defines 
an idea of a middleware positioned between information sources (information 
providers) and information consumers. Mediators support modelling facilities 
and methods for conversion of unorganized, nonsystematic population of au- 
tonomous information sources keepecl by different information providers into a 
well-structured information collection defined by the integrated uniform specifi- 
cations. Mediators provide also a uniform query interface to the multiple data 
sources, thereby freeing the user from having to locate the relevant sources, 
query each one in isolation, and combine manually the information from them. 
Important application areas greatly benefit from the subject mediation approach 
supporting information integration in a particular subject domain. Among them 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 70—83, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




Information Sources Registration at a Subject Mediator 



71 



are Web information integration systems, digital libraries providing content in- 
teroperability, digital repositories of knowledge in certain domains (like: Digital 
Earth, Digital Sky, Digital Bio, Digital Law, Digital Art, Digital Music). 

For such areas, according to the approach, the application domain model is 
to be defined by the experts in the field independently of relevant information 
sources. This model may include specifications of data structures, terminologies 
(thesauri), concepts (ontologies), methods applicable to data, processes (work- 
flows), characteristic for the domain. These definitions constitute specification 
of a subject mediator. After subject mediator had been specified, information 
providers can disseminate their information for integration in the subject domain 
independently of each other and at any time. To disseminate they should register 
their information at the subject mediator. Users may not know anything about 
the registration process and about the sources that have been registered. Users 
should know only subject domain definitions that contain concepts, structures, 
methods approved by the subject domain community. Thus various information 
sources belonging to different providers can be registered at a mediator. 

The subject mediation approach is applicable to various subject domains 
in science, cultural heritage, mass media, e-commerce, etc. This technology is 
considered as a promising alternative to the widely used general purpose Web 
search engines characterized by very low precision of search due to uncontrollable 
use of terms for indexing and search. This is unavoidable payment for simplicity 
of sites “registration” at the engines. 

Two basic approaches are known for the mediator architectures [6]. According 
to the first one, called Global as View (GAV), the global schema is constructed 
by several layers of views above the schemas exported by pre-selected sources. 
Queries are expressed in terms of the global schemas and are evaluated similarly 
to the conventional federated database approaches [17]. TSIMMIS [7] or HER- 
MES [18] apply this architecture. Another approach, known as Local as View 
(LAV) , considers schemas exported by sources as materialized views over virtual 
classes of the mediated schema. Queries are expressed in terms of the mediated 
schema. Query evaluation is done by query planning making its rewriting in 
terms of the source schemas. Information Manifold [14] or Infomaster [5] apply 
this strategy. The LAV architecture is designed to cope with a dynamic, possibly 
incomplete set of sources. Sources may change their exported schemas, become 
unavailable from time to time. LAV is potentially scalable with respect to a num- 
ber of sources involved. Further in this paper we assume LAV as an approach 
suitable for the subject mediation. 

Two separate phases of the subject mediator’s functioning are distinguished: 
consolidation phase and operational phase. The consolidation phase is intended 
for the subject model definition. During this phase the mediator’s federated 
schema metainformation is formed. The technique used for that is beyond this 
paper. 

During the operational phase the burden of the sources registration process 
is imposed on the information providers. They formulate sources’ specifications 
(schemas, concept definitions, vocabularies) in terms of the subject mediator’s 




72 



D.O. Briukhov, L.A. Kalinichenko, and N.A. Skvortsov 



federated metainformation and develop the required wrappers. In such process 
of registration the local metainformation sublayer of the mediator is formed 
expressing local schemas in the mediator’s canonical model as views above the 
federated schema. Making source registration concurrently by providers is the 
way to reach the mediator’s scalability. 

During the registration a local source class is modelled as a set of instances 
(objects) of the class instance type, and the description of the source in terms 
of the federated schema specifies the constraints on the class instances to be 
admissible for the subject mediator. Formally, the content of a local source class 
is described by a formulae simplified as C(z) C 3a;(C'i(a:i) &...& C n (x n ) & Con) 
where C is a local source class, C\, ..., C n are federated schema classes, z is 
a reduct [10] of the local class instance type being a concretization of a reduct 
of resulting instance type of the conjunctive formula that includes only those 
attributes of the resulting type whose names are not bound by the existential 
quantifier, Con is additional constraints imposed by formulae. That is, the de- 
fined reduct of an instance obtained from the local source class should satisfy the 
constraint expressed by the formula. Of course, this description does not imply 
that the local source contains all the instances that satisfy the formula. Such 
representation of local sources means that we do not have to add a local source 
class to the federated level whenever sources are added, since this class does not 
have to correspond directly to a federated class. 

General idea of representation of the local classes in terms of federated classes 
is similar to the one proposed in [14]. Main differences of the current approach 
consist in taking into account issues more relevant to real environments, such 
as using general type model and type specification calculus [10], applying the 
refining mapping of the local specific data models into the canonical model of 
the mediator [12], resolving ontological differences between federated and local 
concepts [2], systematic resolving structural, behavioral and value conflicts of 
local and federated types and classes [2] . 

This paper focuses on analysis methods and tools required to support in- 
formation source registration process at the mediator. Main contribution of the 
paper is that for the LAV mediation strategy the paper considers the information 
source registration as the process of compositional information systems develop- 
ment [2] . Local source metainformation definitions are treated as specifications of 
requirements and classes of the federated level with the related metainformation 
- as specifications of pre-existing components. To get local classes definitions as 
views above the federated level with constraints given in the form shown above, 
the facilities of the modified compositional information systems development 
method and tool [2] are applied. This modified method and tool are considered 
to be a part of the mediator’s metainformation support. The method considered 
can be useful also for a Data Warehouse (DW) environment. Main difference 
between the mediator’s and DW approaches consists in that a mediator schema 
is virtual though DW schema is materialized. 

At the moment of the paper writing no works considering a source registration 
at a LAV-approaclred mediator as a design process have been identified. Related 




Information Sources Registration at a Subject Mediator 



73 



papers in DW design could have also been expected. Analysis of the latter is given 
in the Related Work section. The analysis shows that DW design have different 
motivation and apply different methods and models comparing to those that are 
needed for a source registration at the LAV-approached mediators. 

The paper is composed as follows. After brief characterization of the canon- 
ical model and metainformation support at a mediator, the registration process 
is treated in detail. The process of compositional development leading to the def- 
inition of source classes in terms of classes of the federated level is introduced. 
Contextualization of a source at the federated level is a part of this process. Var- 
ious aspects of the registration process are illustrated by an example. Cultural 
heritage is taken as an application domain assuming that a respective mediator 
has been defined. For the sources we exploit slightly modified schemas of Web 
sites of Louvre and Uffizi museums and Z39.50 CIMI profile. Due to the paper 
size limitation, only some classes of the Uffizi site are included. 



2 Related Works 

In [4] the DW Schema is considered to be a description of logical content of 
the materialized views constituting the DW. Each portion of such schema is 
described in terms of a set of definitions of relations, each one expressed in terms 
of a query over the DW Domain Model. A view is actually materialized starting 
from the data in the sources by means of software components, called mediators. 
Source is defined both on a logical level (relations) and on a conceptual level 
(E/R). A query is provided to express logical source as a view above conceptual 
definitions. Integration is seen as the incremental process of understanding and 
representing the relationships between data in the sources, rather than simply 
producing a unified data schema. Description logic which treats n-ary relations 
as first-class citizens is used for the conceptual modeling of both the subject 
domain and the various sources. The approach [4] resembles a consolidation 
phase of a GAV mediator. 

A common understanding of a “well-designed” DW schema in the literature 
[9] is that such schema should have the form of a “star”, i.e., it should con- 
sist of a central fact table that contains the facts of interest to an application, 
and that is connected to a number of dimension tables through referential in- 
tegrity constraints based on the various dimension keys. Since dimensions can be 
composed of attribute hierarchies, it is often the case that dimension tables are 
unnormalized, and their normalization results in what is known as a snowflake 
schema. 

Known research works in DW design (including [4,8,9,19]) have different 
motivation and apply different methods and models comparing to those that 
are needed for a source registration at the LAV-approached mediators or at DW 
with a predefined conceptual schema. 




74 



D.O. Briukhov, L.A. Kalinichenko, and N.A. Skvortsov 



3 Fundamentals of the Compositional IS Development 
Method 

The main distinguishing feature of the method considered is a creation of com- 
positions of component specification fragments refining specifications of require- 
ments. Widely used component-based development methods (e.g., JavaBeans) 
construct aggregates of components differently - just linking ports of compo- 
nents with each other or considering their interactions on the contractual basis 
[i5]. 

Refining specifications obtained during the compositional development, ac- 
cording to the refinement theory, can be used anywhere instead of the refined 
specifications of requirements without noticing such substitutions by the users. 
The refinement methods [1] allow to justify the fact of a refinement formally to 
guarantee the adequacy of the specifications obtained to that of the required. 



3.1 Compositional Specification Calculus 

The design is a process of systematic manipulation and transformation of speci- 
fications. Type specifications of the canonical model (SYNTHESIS) [13] are cho- 
sen as the basic units for such manipulation. The manipulations required include 
decomposition of type specifications into consistent fragments, identification of 
reusable fragments (patterns of reuse), composition of identified fragments into 
specifications concretizing the requirements, justification of reusability and sub- 
stitutability of the results of such transformations instead of the specifications 
of requirements. The compositional specification calculus [10] intentionally de- 
signed for such manipulations uses the following concepts and operations. 

A signature St of a type specification T =< Vt,Ot,It > includes a set of 
operation symbols Ot indicating operations argument and result types and a set 
of predicate symbols It (for the type invariants) indicating predicate argument 
types. Conjunction of all invariants in It constitutes the type invariant. We 
model an extension Vt of each type T (a carrier of the type) by a set of proxies 
representing respective instances of the type. 

Definition 1. Type reduct A signature reduct Rt of a type T is defined as a 
subsignature S' T of type signature St that includes a carrier Vt, a set of symbols 
of operations 0' T C Ot, a set of symbols of invariants I' T C I T . 

This definition from the signature level can be easily extended to the specifi- 
cation level so that a type reduct Rt can be considered a subspecification (with 
a signature S' T ) of specification of the type T. The specification of Rt should be 
formed so that Rt becomes a supertype of T. We assume that only the states 
admissible for a type remain to be admissible for a reduct of the type (no other 
reduct states are admissible). Therefore, the carrier of a reduct is assumed to be 
equal to the carrier of its type. 

Definition 2. Type U is a refinement of type T iff 




Information Sources Registration at a Subject Mediator 



75 



— there exists a one-to-one correspondence Ops : Ot Ou ; 

— there exists an abstraction function Abs :Vt—>Vu that maps each admissible 
state of T into the respective state of U ; 

— \/x e V T By e V u (Abs(x,y) => I T A I v ) 

— for every operation o € Ot the operation Ops(o) = o' € Ojj is a refine- 
ment of o. To establish an operation refinement it is required that operation 
precondition pre(o) should imply the precondition pre(o') and operation post- 
condition post(o') should imply postcondition post(o). 

Based on the notions of reduct and type refinement, a measure of common 
information between types in can be established. 

Definition 3. A common reduct for types T\ , T 2 is such reduct Rt x of Xj that 
there exists a reduct Rt 2 ofT 2 such that Rt 2 is a refinement of Rt±- Further we 
refer to Rt 2 as to a conjugate of the common reduct. 

Definition 4. A most common reduct Rmc{T\,T 2 ) for types T\,T 2 is a reduct 
Rt x of Tj such that there exists a reduct Rt 2 of T 2 that refines R ^ and there 
can be no other reduct R l Ti such that Rmc{Ti,T 2 ) is a reduct of R l Tl , R l Tl is not 
equal to Rmc(Ti,T 2 ) and there exists a reduct R l T2 of T 2 that refines R z Tl - 

Reducts provide for type specification decompositions thus creating a basis 
for their further compositions. Type composition operations can be used to in- 
fer new types from the existing ones. We introduce here definition of only one 
operation - join. Let Tj(l < i < n) 6 T denotes types. 

Definition 5. Type join operation. An operation Tj U T 2 produces type T 
as a ’join’ of specifications of the operand types. Generally T includes a merge 
of specifications of T\ and T 2 . Common elements of specifications of T\ and T 2 
are included into the merge (resulting type) only once. The common elements 
are determined by another merge - the merge of conjugates of two most common 
reducts of types Ti and T 2 : Rmc{Ti,T 2 ) and Rmc{T 2 ,T{). The merge of two 
conjugates includes union of sets of their operation specifications. If in the union 
we get a pair of operations that are in a refinement order then only one of them, 
the more refined one (belonging to the conjugate of the most common reduct) is 
included into the merge. Invariants created in the resulting type are formed by 
conjuncting invariants taken from the original types. 

A type T is placed in the type hierarchy as an immediate subtype of the join 
operand types and a direct supertype of all the common direct subtypes of the 
join argument types. 

Operations of the compositional calculus form a type lattice [10] on the basis 
of a subtype relation (as a partial order). 

3.2 Design Phase 

Design is the component-based process of concretization of a specification ob- 
tained on an analysis phase by an interoperable composition of pre-existing in- 
formation components. It includes reconciliation of application domain (here, 




76 



D.O. Briukhov, L.A. Kalinichenko, and N.A. Skvortsov 



federated schema) and of information source ontological contexts establishing 
the ontological relevance of constituents of requirements and components speci- 
fications, identification of component types (classes) and their fragments (identi- 
fying most common reducts) suitable for the concretization of an analysis model 
type (class) capturing its structural, extensional and behavioural properties, 
composition (using type operations join and meet ) of such fragments into speci- 
fications concretizing the requirements, justification of a property of refinement 
of requirements by such compositions. 



4 Registration of Information Sources at a Subject 
Mediator 

4.1 General Schema of a Process of an Information Source 
Registration at the Mediator 

As a preliminary step, provide an ontological integration of an information source 
specification with the federated level specification. The process of the ontological 
integration is the same as for the compositional development method. 

After the ontological integration, for each source class the following steps are 
required: 

1. relevant federated classes identification 

Find federated classes that ontologically can be used for defining source 
class extent in terms of federated classes. To a source class several federated 
classes may correspond covering with their instance types different reducts of 
an instance type of the source class. On another hand, several source classes 
may correspond to one federated class. 

2. most common reducts construction 

For an instance type of each identified federated class do: 

a) Construct most common reducts for instance type of this federated class 
and source class instance type to concretize (partially) such federated in- 
stance type. Most common reduct may include also additional attributes 
corresponding to those federated type attributes that can be derived from 
the source type instances to support them. 

b) In this process for each attribute type of the common reduct a con- 
cretizing type, concretizing function or their combination should be con- 
structed (this step should be recursively applied). 

3. partial source view construction 

a) For each relevant federated class construct a partial source view express- 
ing a constraints in terms of the federated class that should be satisfied 
by values of respective most common reducts of source class instances. 
Thus partial views over all relevant federated classes will be obtained. 

4. partial views composition 

a) Construct compositions of the source type most common reducts ob- 
tained for instance types of all federated classes involved. 




Information Sources Registration at a Subject Mediator 



77 




Fig. 1 . Specifications of types of the federated schema 



b) Construct a source view as a composition of partial views obtained above. 
This is an expression of a materialized view of an information source in 
terms of federated classes. An instance type of this view is determined 
by the most common reducts composition constructed above. 

4.2 Example of Subject Mediator and Information Source 
Specifications 

During the rest of the paper we demonstrate our approach on an example show- 
ing the registration of the schema of the Uffizi museum Web site at a subject 
mediator. Cultural heritage is taken as an application domain of the mediator. 
For the federated and local source type schema we use UML diagrams and do 
not provide canonical model specifications to save space. 

Figure 1 shows a part of the federated schema of the mediator for the cultural 
heritage domain. 

Figure 2 shows a part of the schema of the Uffizi museum Web site. We apply 
its description similar to one used in the Araneus project. 

Text and Textual types define interfaces of different types used for textual 
search (in Z39.50 and in Oracle respectively). Establishing of a refinement rela- 
tionships for these type is beyond this paper. 

4.3 Ontological Integration Example 

Ontological integration is used to establishing ontological relevance of constitu- 
ents of mediator’s federated schema and local source schema specifications. Spec- 








78 



D.O. Briukhov, L.A. Kalinichenko, and N.A. Skvortsov 




Fig. 2. Specifications of types of the Uffizi site schema 



ifications of federated schema and specifications of local sources must be asso- 
ciated with ontological contexts containing concepts of the respective subject 
areas. Each element of specifications must be associated with an ontlogical con- 
cept, describing the element. Elements (types, classes, attributes and others) are 
defined as instances of semantically relevant ontological classes (concepts). 

The ontological concepts have their verbal definitions and descriptor lists. 
Verbal concept definitions are similar to definitions of words in an explanatory 
dictionary. Descriptor lists included in the specifications of concepts are built 
on the basis of the meaningful words lists in the concept’s verbal definition. 
Concept descriptors are required for establishing relationships between concepts. 
Hypernym/hyponym relationships and positive relationships (synonym) can be 
defined between ontological concepts. These relationships can be treated as fuzzy 
ones. 

Relationships between concepts of different contexts are established by cal- 
culating the correlation coefficients between concepts on the basis of their verbal 
definitions. The correlation coefficients are calculated using the vector-space ap- 
proach [16,3]: 



sim{X, Y) 



j:l =1 (w xk -w Yk ) 

\/ELi(^ fc ) 2 -ELi(Wv fe ) 2 



The range of values of the simulating function sim (X , Y) is the real interval 
[0.0, 1.0]. The concept X is considered positively related to the concept Y if 
sim(X , Y) is greater than a certain threshold value £; in this case, the intercon- 
text positive relationship is established between X and Y, and the value of the 
function is considered as the relationship strength. 

Let’s consider the federated concept IconographicObject which associated 
with the element Painting of federated schema and the local concept ArtWork 
which associated with the element Canvas of local source schema. Specifications 
of the concepts look as follows: 



{IconographicObject ; 
in: concept; 

supertype : ManMadeOb j ect , ConceptualOb j ect ; 









Information Sources Registration at a Subject Mediator 



79 



def : "This entity comprises objects which are designed primarily or in 
addition to another functionality to represent or depict something in 
an optical manner, be it concrete or abstract. This entity has a 
certain pragmatic value in the fine arts since it conveniently groups 
together objects such as paintings, drawings, watercolours and other 
similar objects." 

} 

{ArtWork; 

in: concept; 

def: "Art such as sculpture, drawings or paintings depicting something" 



The correlation coefficient of the similarity function is the following: 
sim(IconographicObject, ArtWork) = 0.2491 

Basing on this coefficient we establish the positive relationship between these 
concepts. The correlation between the schema elements Painting and Canvas is 
established with the same strength. Analogously hypernym/hyponym concept 
relationships are established [3]. 



4.4 Most Common Reducts Construction 

Construct most common reducts for the federated types Painting and Creator 
and their ontologically relevant source type Canvas (Figs. 1, 2). On construction 
of the most common reduct of types defined in the specification of requirements 
(local source definition) and in a component (federated schema classes), various 
conflicts between the specifications should be discovered and resolved. The re- 
sult of the conflicts resolution is represented in a transformation of a conjugate 
reduct into a concretizing reduct. The concretizing reduct specification includes 
together with the attributes of a reduced type the mapping of a concretizing 
reduct attributes into attributes of a common reduct and the conflict resolution 
functions. 

The most common reducts and their concretizing reducts are specified as 
SYNTHESIS types. Specific metaslots and attributes are used to represent spe- 
cific semantics. Specification of common reduct for Painting and Canvas denoted 
as R-Painting -Canvas is defined as follows: 

{R_Painting_Canvas ; 
in: reduct; 
metaslot 

of: Painting; 

taking: {title, created_by, date_of _origin, narrative, digital_f orm, 
in_collection} ; 
c_reduct : CR_Painting_Canvas ; 
end; 

} 




80 



D.O. Briukhov, L.A. Kalinichenko, and N.A. Skvortsov 



A slot of refers to the reduced federated type. A list of attributes of the 
reduced type in the slot taking contains names of its attributes that are to be 
included into the common reduct. 

A slot C-reduet refers to the concretizing reduct based on a source type. We 
add a concretizing reduct for the Painting , Canvas pair of types as the following 
type definition: 



{CR_Painting_Canvas ; 
in: c_reduct ; 
metaslot 

of : Canvas ; 

taking: {title, painter, date, description, to_image}; 
reduct: R_Painting_Canvas 

end; 

simulating: { 

R_Painting_Canvas .title ~ CR_Painting_Canvas .title ; 

R_Painting_Canvas . created_by ~ CR_Painting_Canvas . get_created_by ; 
R_Painting_Canvas . date_of _origin " CR_Painting_Canvas . date ; 
R_Painting_Canvas .narrative ' CR_Painting_Canvas . description; 
R_Painting_Canvas . digital_f orm ~ CR_Painting_Canvas . to_image ; 
R_Painting_Canvas . in_collection ~ CR_Painting_Canvas ,get_in_collection 

>; 

get_created_by : {in: function; 

params : {+ext/CR_Painting_Canvas , -returns/Creator]-; 
predicative: {ex c/Canvas ((c/CR_Painting_Canvas = ext) & 

ex a/Artist ( (c .painter=a.name) & returns=a/CR_Creator_Artist) ) ) } 

>; 

get_in_collection: {in: function; 

params: {+ext/CR_Painting_Canvas , -returns/Collection} 
predicative: {ex c/Canvas ((c/CR_Painting_Canvas = ext) & 

ex r/Room ( (in(c ,r .paint_list) & returns=r/CR_Room_Collection) ) ) } 

> 

} 



A slot of refers to the source type. In this case a slot reduct refers to the 
respective common reduct. The predicate simidating shows how the concretizing 
state is mapped into the common reduct state. 

E.g., R -Painting -Canvas, date -of -origin ~ CR-Painting-Canvas.date defines 
that attribute date-of-origin of reduct R-Painting-Canvas is refined by at- 
tribute date of concretizing reduct CR-Painting-Canvas and values of attribute 
date-of-origin are taken from values of attribute date, get-created-by presents the 
mediating function resolving the conflict in mixed pre- and post-conditions. 

The most common reduct and concretizing reduct for types Creator and 
Canvas are constructed similarly. 




Information Sources Registration at a Subject Mediator 81 

4.5 Source View Construction 

Now for each relevant federated class we construct a formula expressing a con- 
straint in terms of the federated classes that should be satisfied by local class 
instances. 

The formula expressing the local class canvas is terms of the federated class 
painting is defined as: 

canvas(p/ CR-Painting-Canvas) C paintingfp/ R-Painting -Canvas) & 
p.in-collection.in-repository =' U f fizi' 

Specification of a class (actually, this is local as view class) containing this 
formula is: 

{v_canvas_painting; 
in: class; 
class_section: { 

key: invariant, {unique; {title}}-; 
lav: invariant, {subseteq (v_canvas_painting(p) , 
paint ing(p/R_Painting_Canvas) & 
p . in_collection. in_repository = ’Uffizi’)} 

}; 

instance_section: CR_Painting_Canvas 

} 



Now, another formula expressing the local class canvas is terms of the feder- 
ated class creator is defined as: 

canvas{c/ C RjCreator -C anvas) C creator(c/ RJGreator -Canvas) & 
3w/Painting(in(w,c.works) & w.in-collection.in-repository = Uffizi 1 ) 

Specification of a respective class v -canvas -creator denoting local source as 
view is similar to the specification of class v -canvas -painting. 

Now we apply type compositions and identify a useful part of Canvas type 
as a composition of type reducts obtained before. We consider a concretiz- 
ing reduct CR-Painting-Creator-Canvas as the join of the concretizing reducts 
CR-Painting-Canvas and CR -Creator-Canvas. 

CR-Painting-Creator-Canvas = CR-Painting-Canvas U C R-Creator -C anvas 

A final formula for a local class canvas in terms of the federated classes 
painting and creator is created as a conjunction of the partial constraints: 

canvas(p/ C R-Painting -Creator -Canvas) C 

painting{p/ R_Painting -Canvas) & p.in-collection.in-repository =' Uffizi' & 
creator(c/ RJJreator -Canvas) & 3w / Painting (in{w , c.works) & 
w .in-collection.in-repository =' Uffizi') 



Complete definition of source view looks as follows: 




82 



D.O. Briukhov, L.A. Kalinichenko, and N.A. Skvortsov 



{v_canvas ; 
in: class; 
class_section: { 

key: invariant, -[unique; {title}}-; 
lav: invariant, {subseteq(v_canvas , 
paint ing(p/R_Painting_Canvas) & 
p . in_collection. in_repository = ’Uffizi’ & 

creator (c/R_Creator_Canvas) & ex w/Painting (in(w, c .works) & 
w . in_collection. in_repository = ’Uffizi’)}) 

}; 

instance_section: CR_Painting_Creator_Canvas ; 



The last composition may be considered as a join composition of types defined 
in class sections of partial views (that is, as a composition of types of classes 
treated as objects). 



5 Conclusion 

The paper presents a method for heterogeneous information sources registration 
at subject mediators. Source specifications as materialized views above virtual 
classes of mediator are designed applying compositional development method 
(source definition is treated as a specification of requirements and class defi- 
nitions of federated schema are treated as component specifications). This ap- 
proach is intended to cope with a dynamic, possibly incomplete set of sources. 
Sources may change their exported schemas, become unavailable from time to 
time. To disseminate the information sources, their providers should register 
them at a respective subject mediator. Such registration can be done con- 
currently and at any time. To make subject mediators scalable with respect 
to a number of sources involved, specific methods and tools supporting pro- 
cess of information sources registration are required. The method is applicable 
to wide class of source specification models representable in hybrid semistruc- 
tured/object canonical mediator model [11]. 

The registration tool is being developed reusing the SYNTHESIS composi- 
tional design method prototype [2]. The tool is based on Oracle 8i and Java 2 
under Windows environment. 



References 



1. R.-J. Back, J. von Wright. Refinement Calculus: A systematic Introduction. 
Springer Verlag, 1998 

2. D. O. Briukhov, L. A. Kalinichenko. Component-Based Information Systems Devel- 
opment Tool Supporting the SYNTHESIS Design Method. In Proc. of the East Eu- 
ropean Symposium on ’’Advances in Databases and Information Systems”, Poland, 
Springer, LNCS No. 1475, 1998 




Information Sources Registration at a Subject Mediator 



83 



3. D. O. Briukhov, S. S. Shumilov. Ontology Specification and Integration Facilities 
in a Semantic Interoperation Framework, In Proc. of the International Workshop 
ADBIS’95, Springer, 1995 

4. D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati. Source Inte- 
gration in Data Warehousing, In Proc. of the 9th Int. Workshop on Database and 
Expert Systems Applications (DEXA-98), pages 192-197. IEEE Computer Society 
Press, 1998 

5. O. Duschka and M. Genesereth. Answering Queries Using Recursive Views. In 
Principles Of Database Systems (PODS), 1997 

6. M. Friedman, A. Levy, and T. Millstein. Navigational Plans for Data Integration, 
1999 

7. H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and 
J. Widom. The Tsimmis Approach to Mediation: Data Models and Languages. 
Journal of Intelligent Information System, 1997 

8. M. Golfarelli, D. Maio, S. Rizzi. Conceptual Design of Data Warehouses from E/R 
Schemes In Proc. of the 31st Hawaii International Conference on System Sciences, 
Kona, Hawaii, 1998 

9. B. Husemann, J. Lechtenborger, G. Vossen. Conceptual Data Warehouse Design. 
In Proc. of the International Workshop on Design and Management of Data Ware- 
houses (DMDW’2000), Stockholm, Sweden, June 5-6, 2000 

10. L. A. Kalinichenko. Compositional Specification Calculus for Information Systems 
Development. In Proc. of the East-West Symposium on Advances in Databases and 
Information Systems (ADBIS’99), Maribor, Slovenia, September 1999, Springer 
Verlag, LNCS, 1999 

11. L. A. Kalinichenko. Integration of heterogeneous semistructured data models in the 
canonical one. In Proc. of First Russian National Conference on ’’Digital Libraries: 
Advanced Methods and Technologies, Digital Collections” , Saint-Petersburg, Octo- 
ber 1999 

12. L. A. Kalinichenko. Method for data models integration in the common paradigm. 
In Proc. of the First East European Workshop Advances in Databases and Infor- 
mation Systems’, St. Petersburg, September 1997 

13. L. A. Kalinichenko. SYNTHESIS: the language for desription, design and pro- 
gramming of the heterogeneous interoperable information resource environment. 
Institute for Problems of Informatics, Russian Academy of Sciences, Moscow, 1995 

14. A. Levy, A. Rajaraman, and J. Ordille. Querying Heterogeneous Information 
Sources using Source Descriptions. In Proc. of the 22nd Conf. on Very Large 
Databases, pages 251-262, 1996 

15. M. Lumpe. A Pi-Calculus Based Approach to Software Composition, Ph.D. the- 
sis, University of Bern, Institute of Computer Science and Applied Mathematics, 
January 1999 

16. G. Salton, C. Buckley. Term- Weighting Approaches in Automatic Text Retrieval. 
Readings in Information Retrieval, K. S. Jones and P. Willett, Kaufmann, 1997 

17. A. Sheth and J. Larson. Federated Database Systems for Managing Distributed, 
Heterogeneous, and Autonomous Database. ACM Computing Surveys, 1990 

18. V. S. Subrahmanian. Hermes: a Heterogeneous Reasoning and Mediator System, 
http : //www . cs .umd. edu//proj ects/hermes/publicat ions/post script s/tois .ps 

19. N. Tryfona, F. Busborg, J. G. Borch Christiansen. starER: A Conceptual Model 
for Data Warehouse Design, ACM Second International Workshop on Data Ware- 
housing and OLAP (DOLAP), 1999 

20. G. Wiederhold. Mediators in the Architecture of Future Information Systems. 
IEEE Computer, 1992 




Extracting Theme Melodies by Using a Graphical 
Clustering Algorithm for Content-Based Music 
Information Retrieval 



Yong-Kyoon Kang, Kyong-I Ku, and Yoo-Sung Kim 



Department of Computer Science & Engineering 
INHA University, INCHEON 402-751, Korea 
yskim@inha . ac . kr 



Abstract. We proposed the mechanism of extracting theme melodies from a 
song by using a graphical clustering algorithm. In the proposed mechanism, a 
song is split into the set of motifs each of which is the minimum meaningful 
unit. Then the system clusters the motifs into groups based on the similarity 
values calculated between all pairs of motifs so that each cluster has higher 
similarity values between them than others. From each clusters, the system se- 
lects a theme melody based on the positions of the motif within a song and the 
maximum summation of similarity values of edges adjacent to the motif node in 
each cluster. As the experimental results, we showed an example in which we 
describe how the theme melodies of a song can be extracted by using the pro- 
posed algorithm. 



1 Introduction 



As user’s requests for systematic management of multimedia information have been 
increased continuously, the content-based multimedia information retrieval is neces- 
sary to satisfy user’s requirement. For music information, the traditional music re- 
trieval mechanisms using the metadata of music have the major restriction; users 
should remember some appropriate metadata of a song to make the retrieval query. If 
users do not know the metadata of the song they want, they might not retrieve the 
song. This problem is caused by the general fact that people prefer to remember a part 
of song rather than its metadata. To solve the restriction of the traditional music re- 
trieval mechanism using the metadata, the effective content-based music retrieval 
mechanism is needed. In content-based music retrieval system, users input some part 
of a song they remembering then the system retrieves songs that contain some varia- 
tions of the given melodies. 

As the content-based music retrieval systems, several systems have been developed 
([1,2]). However, the previous systems have two major problems. First, these systems 
do not have indexing mechanism that is helpful to improve the retrieval performance. 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 84-97, 2001. 
© Springer-Verlag Berlin Heidelberg 2001 




Extracting Theme Melodies by Using a Graphical Clustering Algorithm 



85 



Second these systems do not use the full primitive features that are essential to rep- 
resent the semantics of music. 

To solve the first problem, previous researches([3,4,5]) have proposed theme mel- 
ody index schemes in which the theme melodies extracted from a song are included. 
However, in [3], [4], and [5], theme melodies are represented by using only the 
pitches of notes within a melody. Furthermore, they consider the exactly repeated 
patterns within a music object as the theme melodies of the music. That is, the previ- 
ous extracting mechanism can not extract the approximately! not exactly) repeated 
patterns as the theme melodies. In general, however, theme melodies can be repeated 
more than once with some variations within a song. Hence, an extraction mechanism 
that must deal with some variation of theme melody is highly required for the effec- 
tive content-based music retrieval. 

In our previous research([6]), we proposed a similarity computation algorithm be- 
tween music objects based on their contents, not only pitches but also dura- 
tions(lengths) of notes. Also, in our another previous research([7]), we showed how 
the theme melodies can be used as the indexing terms for the efficient retrieval. How- 
ever, we did not fully describe how the approximately repeated theme melodies are 
extracted from a song. 

In this paper, we proposed a theme melody extraction mechanism in which a modi- 
fied version of the graphical clustering algorithm of [ 12] is used for grouping the ap- 
proximately repeated motifs into a cluster. In the first step of the extraction mecha- 
nism, we split a music file into the set of motifs each of which is the minimum mean- 
ingful unit. Then the system extracts pitches and lengths of notes as the song's primi- 
tive. Then the system calculates the similarity values between all pairs of motifs 
within a song based on the pitches and lengths of the corresponding notes between 
two motifs and clusters the motifs into groups based on the similarity values so that 
each cluster has higher similarity values between them than others. From each cluster, 
the system selects theme melody based on the positions of the motifs within a song 
and the maximum summation of the similarity values of the related motifs in each 
cluster. 

The rest of this paper is organized as follows. In Sect. 2, we briefly describe the 
fundamental features of music and the general definition of theme melody. We also 
discuss the previous related works on the extraction of theme melodies from a song. 
In Sect. 3, we introduce a theme melody extraction mechanism that uses the modified 
graphical clustering algorithm. Also, we describe how the theme melodies that are 
approximately repeated within a song can be effectively extracted from a song for the 
content-based music information retrievals. Section 4 concludes this paper with the 
description of future works. 




86 



Y.-K. Kang, K.-I Ku, and Y.-S. Kim 



2 Related Works 

2.1 The Basic Concept of Theme Melody 

Music listeners get the semantics of music from the sequence of notes. The note that 
organizes music basically has four characteristics such as pitch, intensity, duration, and 
timbre. A music composed from the continuous notes has three characteristics: rhythm, 
melody, and harmony. Among the above characteristics of music, we adopt the melody 
as the primitive feature of music for supporting content-based music information re- 
trieval because the melody can be a decisive component for representing music 
([6,7,8]). 

As we described above, music has a pattern of notes, i.e. melodies as the important 
component. A note alone can not be a pattern. As notes are added, the duration rela- 
tionship between continuous notes is accomplished and finally a melody becomes 
clear. The pattern generally has a hierarchic structure that allows listeners to under- 
stand the music ([9]). 

The motif is the minimum pattern that has some meaning by itself within a song. 
The theme is one of the important motifs that have the characteristic, ‘theme rein- 
statement’ that means the theme must be repeated more than once with some vari- 
ances within a song ([8,9]). According to the composer and the song, the degree of 
variances of the theme melodies varies. In general, music listeners remember the 
theme melodies of a song as the main meaning of the song. 



2.2 Previous Works on Theme Extraction from Music 

To improve the performance of the content-based music retrieval systems, index 
mechanisms that contain the representative parts, theme melodies, of music as the 
indexing terms have been proposed. For that, several researches ([3,11]) have pro- 
posed the extraction mechanisms of theme melodies from music objects. Among 
them, in this section, we will discuss the mechanism proposed in [3] since this 
mechanism is considered as an advanced mechanism. 

In this mechanism, a melody is represented as the string of absolute pitches of 
notes in the melody. For a sub-string Y of a music feature string X, if Y appears 
more than once in X, we call Y a repeating pattern of X. The frequency of the 
repeating pattern Y, denoted asfreq(Y), is the number of appearances of Y in X. 

Consider the melody string “do-re-mi-fa-do-re-mi-do-re-mi-fa”, this melody string 
has ten repeating patterns. However, the repeating pattern “re-mi-fa”, “mi-fa”, and 
“fa” are sub-strings of the repeating pattern “do-re-mi-fa” and /ra?(“do-re-mi-fa”) = 




Extracting Theme Melodies by Using a Graphical Clustering Algorithm 



87 



/re< 7 (“re-mi-fa”) = freq(“ mi-fa”) = freqC fa”) = 2. Similarly, the repeating patterns 
"do-re”, “re-mi”, “do”, “re”, and “mi” are sub-strings of the repeating pattern “do-re- 
mi” and the/m/(“do-re-mi”) =freq{“ do-re”) =/reg(“re-mi”) =/reg(“do”) =/m/(“re”) 
= freq(“ mi”) = 3. Among the ten repeating patterns of the melody string, only “do-re- 
mi-fa” and “do-re-mi” are considered as non-trivial. The non-trivial patterns in a song 
are considered as the theme melodies of the song. 

By using this mechanism, the exactly repeated longest patterns of a music ob- 
ject are considered and extracted as the theme melodies of the music. In other 
words, the previous mechanism does not consider the approximately (not exactly) 
repeating pattern in extraction of theme melodies. In general, however, a theme 
melody can be approximately repeated more than once with some variations of 
pitches and lengths of notes in the melodies. Therefore, for effective content- 
based music retrieval, we need a theme melody extraction mechanism that can 
deal with the some variation of theme melody during extracting theme melodies 
from music object. 



3 An Extraction Mechanism of Theme Melodies from Music 

3.1 The Procedure of Theme Melody Extraction Mechanism 

The procedure of the theme melody extraction mechanism proposed in this paper is 
shown in Fig. 1. When a music file is submitted, it is decomposed into the set of mo- 
tifs each of which is the minimum meaningful unit. From motifs, pitches and lengths 
of notes are extracted. Then the mechanism computes the similarity values between 
all pairs of motifs of the song by using the similarity computation equation proposed 
in [6]. At next, the similarity matrix (or the similarity graph) is constructed. By using 
the proposed graphical clustering algorithm that is described in Sect. 3.2, the motifs of 
the song are clustered based on the similarity values between two motifs, of the simi- 
larity matrix. Finally, a representative melody from each cluster is selected based on 
the locations of the melodies in the song and the maximum summation of the similar- 
ity values concerned to the melody in a cluster. 




88 



Y.-K. Kang, K.-I Ku, and Y.-S. Kim 



Submit a music file 




Set of the theme melodies of the input music file 



Fig. 1. The Procedure of Theme Melody Extraction Mechanism 



As an example, we consider a Korean children song of Fig. 2. The song is divided 
into 8 motifs. The motif is labeled with the circled number in Fig. 2. 




Fig. 2. The Score of a Korean Children Song “Nabi Nabi Hin-Nabi” 



The similarity values between all pairs of motifs are computed and the similarity ma- 
trix is shown in Table 1. The entry (i, j) of i-th row and j-th column in the similarity 
matrix stands for the similarity value between i-th motif and j-th motif of the song. 
From the score of Fig. 2, since we can recognize that first and third and seventh mo- 



Extracting Theme Melodies by Using a Graphical Clustering Algorithm 



89 



tifs are exactly same to each other, the entries (1, 3), (1, 7), and (3, 7) have 100 as the 
similarity values. Also since the similarity matrix is of triangular matrix, the entries 
(3, 1), (7, 1), and (7, 3) have 100 too. 



Table 1 . The Similarity Matrix for “Nabi Nabi Hin-Nabi” 



Motifs 


1 


2 


3 


4 


5 


6 


7 


8 


1 


100 


85 


100 


82 


86 


62 


100 


82 


2 


85 


100 


85 


96 


81 


73 


85 


96 


3 


100 


85 


100 


82 


86 


62 


100 


82 


4 


82 


96 


82 


100 


77 


70 


82 


100 


5 


86 


81 


86 


77 


100 


68 


86 


77 


6 


62 


73 


62 


70 


68 


100 


62 


70 


7 


100 


85 


100 


82 


86 


62 


100 


82 


8 


82 


96 


82 


100 


77 


70 


82 


100 



3.2 A Graphical Clustering Algorithm 

To form a fragment in database design, [12] proposed a graphical clustering algorithm 
that subdivides the attributes of a relation into groups of the attributes based on the 
attribute affinity matrix, which is generated from the attribute co-relationships. 

Since theme melody extraction mechanism needs clustering algorithm that clusters 
motifs into groups of the motifs based on the similarity matrix doing like in the frag- 
mentation in database design, with the slight modification, the graphical clustering 
algorithm proposed in [12] can be applied to the theme melody extraction problem. 

To describe the modified graphical clustering algorithm in detail, we first introduce 
the notations and terminologies that are used in [ 12]. 

• Capital letters or numbers denote nodes, i.e., motifs of a song. 

• Lowercase letters or parenthesized two node numbers such like (1-2) denote edges 
between two nodes. 

• P(e) denotes the similarity value of edge e between two motifs. 

• Primitive cycle denotes any cycle in the similarity graph. 

• Cluster cycle denotes a primitive cycle that contains a cycle node. In this paper, we 
assume that a cycle stands for a cluster cycle unless otherwise stated. 

• Cycle completing edge denotes an edge that would complete a cycle, 

• Cycle node is that node of the cycle completing edge, which was selected earlier. 





90 



Y.-K. Kang, K.-I Ku, and Y.-S. Kim 



• Former edge denotes an edge that was selected between the last cut and the cycle 
node. 

• Cycle edge is any of the edges forming a cycle. 

• Extension of a cycle refers to a cycle being extended by pivoting at the cycle node. 
Based on the above definition we discuss the mechanism of forming cycles from 

similarity graph. As an example with Fig. 3, suppose edges a and b were selected al- 
ready and c is selected next. At this time, since selection of c forms a primitive cycle, 
(a, b, c), we have to check the possibility of a cycle, i.e., whether it is a cluster cycle. 
Possibility of a cycle results from the condition that no former edge exists, or 
P(former edge) < P(all the cycle edges). The primitive cycle (a, b, c) is a cluster cycle 
because it has no former edge. Therefore, the cluster cycle (a, b, c) is marked as a 
candidate cluster and node A becomes a cycle node. 




Fig. 3. Cycle Detection and Its Extension([12]) 

Let us explain how the extension of a cycle is performed. In Fig. 3, after determining 
A is a cycle node, suppose edge d is selected. Then, d should be checked the possibil- 
ity of extension of the cycle, i.e., whether it is a potential edge for the extension of the 
cycle (a, b, c ). Possibility of extension results from the condition of Pledge being con- 
sidered or cycle completing edge) > P(any one of the cycle edges). Thus, the old cycle 
(a, b, c) can be extended to the new one (a, b, d, f) if the edge d under consideration or 
the cycle completing edge / satisfies the possibility of extension: P (d) or P(/) > mini- 
mum of (P(fl), P (/?), P(c)). Now the process is continued: suppose edge e is selected 
next. But we know from the definition of the extension of a cycle that e cannot be 
considered as a potential extension because the primitive cycle ( d , b, e) does not in- 
clude the cycle node A. Hence it is discarded and the process is continued. 

We explain the relationship between a cluster cycle and a partition. There are two 
cases in partitioning. In both bases, after partitioning, the threshold is updated to this 
cutting value when this cutting value is greater than the saved threshold. 




Extracting Theme Melodies by Using a Graphical Clustering Algorithm 



91 



1. Creating a partition with a new edge. If a new edge (e.g. d in Fig. 3) by itself does 
not satisfy the possibility of extension, we continue to check an additional new 
edge called cycle completing edge (e.g. /in Fig. 3) for the possibility of extension. 
In Fig. 3, new edges d and /would potentially provide such a possibility of exten- 
sion of the earlier cycle (a, b, c ). If edges d and /meet the condition for the possi- 
bility of extension above (i.e., P (d) or P (f) > minimum of (P(a), P (£>), P(c)), then 
the extended new cycle contains edges a, b, d, /. If the condition is not satisfied, we 
produce a cut on edge d (called a cut edge) isolating the cycle (a, b, c) and this cy- 
cle is considered a candidate cluster. 

2. Creating a partition with a former edge. After cutting in the previous case, if there 
is a former edge, then change the previous cycle node to that node where the cut 
edge is incident, and check for the possibility of extension of the cycle by the for- 
mer edge. For example, in Fig. 4, suppose that a, b , and c form a cycle (a, b, c) 
with cycle node A , and that there is a cut on d, and that the former edge w exists. 
Then the cycle node is changed from A to C because the cut edge d originates from 
C. We are now evaluating the possibility of extension of the cycle (a, b, c) into one 
that contains the former edge w, i.e., the cycle (a, b, e, w). If w or e does not satisfy 
the possibility of extension, i.e., P(w) and P(e) < minimum of (P(a), P(£>), P(c)), 
then the result is the following, w is declared as a cut edge, node C remains as the 
cycle node, and edges a, b, c becomes a partition. Otherwise, if the possibility of 
extension is satisfied, the result is the following, cycle (a, b, c ) is extended to cycle 
(e, w, a, b), node C remains as the cycle node, and no partition can yet be formed. 




Fig. 4. Partition ([12]) 

In post-processing, the meaningless clustering edges are eliminated from the candi- 
date clusters by using the threshold value that is continuously updated during parti- 
tioning describe above. Here, the meaningless clustering edge is an edge having 




92 



Y.-K. Kang, K.-I Ku, and Y.-S. Kim 



smaller similarity value then the final threshold. After the post-processing, a cluster in 
which no meaningful edge exists removed from the set of clusters. 

After post-processing, the theme extraction mechanism selects a motif from each 
cluster as the representative of the cluster based on the locations of motifs and the 
similarity values in the cluster. In general, human is likely to remember the first motif 
of a song as a theme melody of the song. Hence, in our mechanism, from the cluster 
in which the first motif is included, we select the first motif as the representative of 
the cluster. Otherwise, i.e., from the cluster in which the first motif is not included, we 
select a motif that has the maximum summation of the similarity values of all adjacent 
edges to the motif node as the representative of the cluster. 

The modified graphical clustering algorithm that can be used for clustering motifs 
based on the similarity matrix is in Fig. 5. 



3.3 Experimentation of the Theme Melody Extraction Mechanism 

As an example for showing the effectiveness of the theme melody extraction mecha- 
nism proposed in this paper, let’s consider the similarity matrix of Table 1 for a Ko- 
rean children song of Fig. 2. Since the similarity matrix of Table 1 can be considered 
as the fully connected similarity graph as shown in Fig. 6a, in the following example 
we use the similarity graph of Fig. 6a instead of the similarity matrix of Table 1. Due 
to the space restriction, we do not present all similarity values in similarity graphs, 
however when we need the similarity value of an edge we present the value as the 
label on the edge. 

From the node numbered 1, the graphical clustering algorithm starts. Among the 
edges that are adjacent to node 1, the edge (1-3) between node 1 and 3 is selected 
since it has the maximum similarity value. Since selecting edge (1-3) does not satisfy 
the condition of step 3, i.e., it does not forms a primitive cycle, and the condition of 
step 4, i.e., there is not a candidate partition, step 2 of Algorithm 1 selects edge (3-7) 
as the next edge for the next iteration. Since edge (3-7) does not satisfy the conditions 
of step 3 and step 4 again, edge (7-1) is selected as the next edge based on the condi- 
tions described at step 2. Since selecting edge (7-1) forms a primitive cycle (1-3, 3-7, 
7-1) (see Fig. 6b), i.e., satisfying the condition of step 3 of Algorithm 1, step 3 is exe- 
cuted. Since there is no cycle node, the "possibility of a cycle” describe in Sect. 3.2 
should be checked. Since there is no former edge, the condition of the possibility of 
cycle becomes true and the primitive cycle is considered as the cluster cycle and also 
becomes a candidate partition. So, node 1 becomes the cycle node of the primitive 
cycle (1-3, 3-7, 7-1). Then, edge (7-5) is selected as the next considering edge for the 
next iteration since it is adjacent to node 7 and, at the same time, it has the maximum 
similarity value among the edges adjacent to node 7. Since the selection of edge (7-5) 




Extracting Theme Melodies by Using a Graphical Clustering Algorithm 



93 



satisfies the condition 4, i.e., edge (7-5) does not form a primitive cycle and a candi- 
date cluster (1-3, 3-7, 7-1) exists and there is no former edge, the possibility of exten- 
sion of the primitive cycle, i.e., whether edge (7-5) can enlarge the primitive cycle 
should be checked. Since edge (7-5) cannot enlarge the primitive cycle, the edge (7-5) 
is get cut and the primitive cycle becomes a candidate cluster and the threshold is set 
to 86 (notes that this cutting value is larger than the initial threshold 0). See Fig. 6d as 
the result of forming the first candidate cluster, (1-3, 3-7, 7-1) with the new threshold 
86. 



Algorithm 1 : The Modified Graphical Clustering Algorithm 

Input: Similarity Matrix of n xn size or the fully connected similarity graph of n nodes 

Output: Set of clusters that has similar motifs 

Step 1. Start from the first node (row 1) with threshold = 0 

Step 2. Select an edge that satisfies the following conditions and, when all nodes are 
used this iteration will end and go to step 5: 

(1) If a cutting occurs in the previous iteration, starts from the next node adjacent 
to the cutting edge. 

(2) Otherwise, it should have the largest value among the possible choices of 
edges be linearly connected to the tree already constructed 

Step 3. When the next selected edge forms a primitive cycle: 

(1) If a cycle node does not exist, check for the "possibility of a cycle" and if the 
possibility exists, mark the cycle as a cluster cycle with the cycle node. Con- 
sider this cycle as a candidate partition. Go to step 2 

(2) If a cycle node exists already, discard this edge and go to step 2 

Step 4. When the next selected edge will not form a cycle and a candidate partition ex- 
ists: 

(1) If no former edge exists, check for the possibility of extension of the cycle by 
this new edge. If there is no possibility, cut this edge and consider the cycle 
as a partition. Cutting value becomes the new threshold when this cutting 
value is greater than the saved threshold. Go to step 2 

(2) If a former edge exists, change the cycle node and check for the possibility of 
extension of the cycle by the former edge. If there is no possibility, cut the 
former edge and consider the cycle as a partition. Cutting value becomes the 
new threshold when this cutting value is greater than the saved threshold. Go 
to step 2 

Step 5. Remove the meaningless edges that have smaller similarity values than the final 
threshold from the set of partitions. 



Fig. 5. The Modified Graphical Clustering Algorithm 






94 



Y.-K. Kang, K.-I Ku, and Y.-S. Kim 



According to the step 2 of Algorithm 1, for the next iteration, edge (5-2) is selected 
in Fig. 6e since edge (5-2) has the maximum similarity value among the edges adja- 
cent to the cut node 5. Then, edge (2-4) and (4-8) are selected in order since the selec- 
tions of these edges do not satisfy the conditions of step 3 and step 4, respectively. 
However, when edge (8-2) is selected as the next edge by the step 2 as shown in 
Fig. 6f, the condition of step 3 is satisfied, i.e., edge (8-2) forms a primitive cycle (2- 
4, 4-8, 8-2). Since there is no cycle node yet, the possibility of cycle is checked. Since 
there is a former edge (5-2) of the primitive cycle and P(former edge) < P(all the cycle 
edges), i.e., P(edge(5-2)) < the minimum of P(edge(2-4), P(edge(4-8), and P(edge 8- 
2) is satisfied, the primitive cycle (2-4, 4-8, 8-2) becomes a cluster cycle and node 2 
becomes the cycle node of the cluster cyclefsee Fig. 6g). And, as the next edge, edge 
(8-5) is selected. Since edge (8-5) does not for a cycle and there is a candidate parti- 
tion, i.e., the condition of step 4 is satisfied and there is a former edge (5-2), change 
the cycle node from node 2 to node 8. the possibility of extension of the cycle by the 
former edge should be checked, whether the former edge has the larger similarity 
value than the cycle edges. In Fig. 6h, since the similarity value 81 of the former edge 
(5-2) is not larger than the similarity values 96, 100, 96 of the cycle edges (2-4), (4-8), 
and (8-2), respectively, edge (5-2) becomes a cut edge and the cluster cycle (2-4, 4-8, 
8-2) becomes a new candidate partition. However, the similarity value 81 of the cut 
edge is not greater than the saved threshold, the threshold is not updated, i.e., the 
threshold remains 86. Fig. 6i shows the second candidate partition formed. 

The remaining nodes 5 and 6 form a third candidate partition as shown in Fig. 6j. 
Hence, from the similarity graph of a Korean children song in Fig. 2, we can form 3 
candidate partitions finally as shown in Fig. 6k and the final threshold is 86. 

In post-processing describe in step 5 of Algorithm 1, the edge (5-6) is removed 
from the third candidate partition, since it has smaller similarity value than the final 
threshold. The third candidate partition (5, 6) is eliminated. As the final result, two 
candidate partitions (1, 3, 7) and (2, 4 8) are returned from Algorithm 1. 

From the final result of Algorithm 1, node 1 and 8 are selected as the representative 
melodies of the first candidate cluster (1, 3, 7) and the second candidate cluster (2, 4, 
8), respectively, since node 1 is the first motif of the song and the first motif give hu- 
man more impression than others generally and the node 8 has greater summation of 
the similarities of adjacent edges to that than others. Hence, the first motif and the last 
motif are selected as the theme melodies of a Korean children song. From the score of 
the song in Fig. 2, we can recognize the first and last motifs have exactly and ap- 
proximately repeated several times to give users the impressions as the main melodies 
of the song. 




Extracting Theme Melodies by Using a Graphical Clustering Algorithm 



95 




Fig. 6. Forming Candidate Clusters from a Similarity Graph 



96 



Y.-K. Kang, K.-I Ku, and Y.-S. Kim 



new 





(g) 



(h) 



(i) 






(j) 




Fig. 6. Forming Candidate Clusters from a Similarity Graph (continued) 



4 Conclusions 



In this paper, we proposed a theme melody extraction mechanism in which a modified 
graphical clustering algorithm is used to cluster the motifs of a song into the set of 
similar melodies. In the first step of the extraction mechanism, we split a music file 
into the set of motifs each of which is the minimum meaningful unit. Then the system 
extracts pitch and duration information as the primitive features from the motifs. Then 
the system clusters the motifs into groups based on the similarity values calculated 
between all pairs of motifs based on the pitches and durations of notes of motifs such 
that each cluster has higher similarity values between them than others. For grouping 
motifs into set of clusters each of which includes similar motifs, we modified the 
graphical clustering algorithm proposed by [12] for database design. From the clus- 
ters, the system selects a theme melody based on the locations of motifs within a song 
and the maximum summation of the similarity value of the motifs in each cluster. As 
the experimental results, we showed an example in which we describe how the theme 
melodies of a song can be extracted by using the proposed algorithm. 



Extracting Theme Melodies by Using a Graphical Clustering Algorithm 



97 



Acknowledgements. This work was supported by grant No. 2000-1-51200-009-2 
from the Basic Research Program of the Korea Science & Engineering Foundation. 



References 



1. A. Ghias, J. Logan, D. Chamberlin and B.C. Smith, "Query By Humming Musical Informa- 
tion Retrieval in an Audio Database," ACM Multimedia, 1995. 

2. R. J. McNab, L.A. Smith, I. H. Witten, C. L. Henderson and S.J. Cunningham, “Towards 
the Digital Music Library: Tune Retrieval from Acoustic Input,” Digital Libraries, 1996. 

3. Chih-Chin Liu, Jia-Lien Hsu, and Arbee L. P. Chen, “Efficient Theme and Non-trivial Re- 
peating Pattern Discovering in Music Databases”, The Proceedings of the IS 11 International 
Conference on Data Engineering, 1999. 

4. Ta-Chun Chou, Arbee L. P. Chen, and Chih-Chin Lie, “Music Databases: Indexing Tech- 
niques and Implementation", The Proceedings of IEEE International Workshop on Multi- 
media Database Manangement Systems, 1996. 

5. Jia-Lien Hsu, Chin-Chin Liu, and Arbee L. P. Chen, "Efficient Repeating Pattern Finding 
in Music Databases”, The Proc 

6. Jong-Sik Mo, Chang-Ho Han, and Yoo-Sung Kim, “A Similarity Computation Algorithm 
for MIDI Musical Information”, The Proceedings of 1999 IEEE Knowledge and Data En- 
gineering Exchange Workshop”, Chicago, USA, 1999. 

7. So-Young Kim and Yoo-Sung Kim, “An Indexing and Retrieval Mechanism Using 
Representative Melodies for Music Databases”, The Proceedings of 2000 International 
Conference on Information Society in the 21st Century, 2000. 

8. Byeong-Wook Lee and Gi-Poong Park, “Everybody Can Compose Songs”, Jackeunwoori 
Pub. Co., 1989. 

9. Leonard B. Meyer, Explaining Music: Essay and Explorations, Saeguang Pub. Co., 1990 

10. S. Wu and U. Mnaber, “Fast Text Searching Allowing Errors”, Communication of ACM, 
Vol. 35, No. 10, 1992. 

11. Arbee L. P. Chen, Jia-Lien Hsu, and C. C. Liu, “Efficient Repeating Pattern Finding in 
Music Databases”, The Proceedings of ACM Conference on Information and Knowledge 
Management, 1998. 

12. Shamkant B. Navathe and Minyoung Ra, “Vertical Partitioning for Database Desing: A 
Graphical Algorithm”, The Proceedings of ACM SIGMOD, 1989. 




A Multilingual Information System Based on 
Knowledge Representation 



Catherine Roussey, Sylvie Calabretto, and Jean-Marie Pinon 
LISI, Bat 501, INS A de Lyon, 20 avenue Albert Einstein F 69621 VILLEURBANNE Cedex 

{ Catherine . roussey, sylvie . calabretto, j ean-marie . pinon} @lisi . insa- lyon . f r 



Abstract. This paper presents a new model for multilingual document indexing 
and information retrieval used in a documentary information system. Multilin- 
gual system exploits multilingual document collection which documents are 
written in different languages, though each individual document may contain 
text in only one language. We have developed a multilingual information re- 
trieval system based on knowledge representation model. In order to carry on 
the implementation of our information retrieval system we chose an expressive 
formalism containing relation properties: the Sowa Conceptual Graph (CG). In 
this article, we define a new model of conceptual graph in order to enhance the 
effectiveness of our matching retrieval function, and we present the architecture 
of our multilingual information system which manages XML documents. Our 
approach has been applied to a collection of mechanical documents in a scien- 
tific library. 



1 Introduction 

This paper deals with a new approach for document indexing and retrieving in a mul- 
tilingual information system. A documentary information system is a system manag- 
ing and exploiting a document collection. By managing, we mean mainly all the op- 
erations linked to the storage and retrieval of the documents by an user. More precise- 
ly, a documentary information system performs different functions: the document 
storage, the document indexing and retrieval, the visualization of docu- 
ments (visualization of entire document, abstract of document, etc.), the document 
edition, the navigation in the document collection and finally the management of 
versions and variants. A “multilingual information system” is a documentary infor- 
mation system exploiting multilingual document collection which documents are 
written in different languages, though each individual document may contain text in 
only one language. The paper focus on the indexing and information retrieval function 
in a multilingual document collection. We present a multilingual information retrieval 
system based on knowledge representation model. Following recent works, the se- 
mantic of index must be enhanced, so the usual list of keywords is transformed to a 
more complex indexing structure, where relations link keywords. Added semantic to 
index implies to propose an adapted matching function. The logical Information Re- 
trieval (IR) model seems to be the only model suitable for managing complex seman- 
tic structures. Based on this model, several systems have been developed using differ- 
ent formalisms and different interpretations. In order to carry on the implementation 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 98-11 1, 2001. 
© Springer-Verlag Berlin Heidelberg 2001 




A Multilingual Information System Based on Knowledge Representation 99 



of our logical IR system we chose an expressive formalism containing relation prop- 
erties: the Sowa Conceptual Graph (CG) [6]. Different systems have used this for- 
malism to implement a logical IR system. Moreover, one notes that these systems 
produce lot of silence that decrease the recall rate. In this article, we define a new 
model of conceptual graph in order to enhance the effectiveness of our matching re- 
trieval function. 

First of all, we present the Conceptual Graph formalism. Follows a recent state of 
the art about logical IR system based on CG. Afterwards, we introduce our model and 
the corresponding algorithms. And finally, we present our prototype and its evalua- 
tion. 



2 Conceptual Graph Formalism 

In 1986, van Rijsbergen [4] models the relevance of a document for query by a logical 
uncertainty principle. That is to say given two logical formulas: d, which is the docu- 
ment representation and q which is the query representation, a matching function 
between d and q measures the uncertainty of d q related to a given data set Ks. 
Indeed, this function determines the minimal transformation necessary on a data set 
Ks to establish the truth of the implication d q. This approach improves the use of 
knowledge in IR System because document descriptors can be more than terms. So 
this model proposes a optimized matching function between document and query, 
taking into account the semantic associated to document descriptors. 

Different interpretations about this logical principle are possible due to the lack of 
precision about the data set Ks [2] : 

1 . Ks can be general domain knowledge related to d and q . So we should evaluate the 
Ks minimal transformation to obtain the implication d q. 

2. Ks can be fixed, d can be modified. So we should evaluate the minimal transfor- 
mation d into d’ in order to obtain the implication d’ q. 

3. q can be changed. So we should evaluate the minimal transformation q into q in 
order to obtain the implication d q\ 

In fact, Rijsbergen propose general model where any logic can be applied. Thus, any 
formalism can be chosen, since it has a logical implication. Some recent works [1,3] 
use the formalism of Sowa Conceptual Graph to implement a logical IR system. 

A conceptual graph is an oriented graph composed of concept nodes, conceptual 
relation nodes (or relation node) and edges between concept and relation nodes. A 
concept is labeled by a type and possibly a marker. Type corresponds to a semantic 
class and marker is a particular instance of a semantic class. For instance, [MAN: *] 
stands for the concept of all possible men. This concept is called a generic concept 
also noted [MAN]. On the other hand, [MAN: Bill Clinton] stands for the concept of a 
man named Bill Clinton. and "Bill Clinton" are two examples of marker. A rela- 
tion node is only labeled by a type. A specialization relation , noted < classifies con- 
cept types and relation types, which link a generic type to a more specific one. A 
relation type has a fixed number of arguments called arity. Arity is the number of 
concepts linked by this relation type. Each relation type has a signature that defines 
the most generic concept type usable as an argument of the relation. 




100 



C. Roussey, S. Calabretto, and J.-M. Pinon 



Specialization relations are useful to compare graphs by the Sowa projection op- 
erator. This operator defines a specialization relation between graphs. As shown in 
Fig. 1, there is a projection of a graph H in a graph G if there exists in G a copy of the 
graph H where all nodes are specialization of H nodes. 




Fig. 1 . A projection example 

Moreover, Sowa proposes a translation mechanism (the 0 operator) from CG into 
first order logical formula. For each graph g, is associated a logical formula 0(g). Sowa 
has proven that there is a relation between the existence of a projection between two 
graphs and the implication of their logical formulas. There is a projection of a graph q in 
a graph cl if the formula 0(d) associated to d implies the formula 0 q ) associated to q: 
0(d) 0q). Therefore, conceptual graph formalism is used to implement logical 

operational systems where the matching function is based one the projection operator. 
For example, we could consider the previous graph H as a representation of a query and 
the graph G corresponds to the index of document. There is a projection of H in G is 
equivalent to 0(G) 0(H) so the document is relevant for the query. 



3 Related Works 



Different logical Information Retrieval System, based on Conceptual Graph formal- 
ism, has been developed, depending on the interpretation of the uncertainty logical 
principle. 



graph d 
graph ql 
graph q2 




Fig. 2. Graphs examples for queries and documents 



1. Ounis and all [3] have experienced the first interpretation of Rijsbergen principle in 
the RELIEF system. The set of domain knowledge Ks is transformed by using re- 













A Multilingual Information System Based on Knowledge Representation 



101 



lational properties such as symmetry, transitivity, inversion etc... This relation 
properties are all captured in Ks in order to refine the indexes and thus improve re- 
trieval effectiveness. Another contribution of this work is to propose a fast match- 
ing function during execution time, even if in graph theory a projection can not be 
performed in polynomial time. Thanks to inverted file and acceleration tables, RE- 
LIEF computes some pre-treatments during indexing time, so projections are per- 
formed faster during retrieval. 

2. David Genest [1] has noted that one of the drawbacks of the projection operator is 
to increase the silence. One of the reasons is that matching function based on pro- 
jection give boolean results. There is a projection from a query to a document or 
there is not. Moreover documents which are judged relevant are only specialization 
of the query. For example, considering Fig. 2 there doesn't exist a projection from 
ql or q2 into doc. However if ql and q2 represent queries and doc represents a 
document index, this document seems to be relevant for these queries. In order to 
take in account such problems, David Genest proposes an implementation of the 
second interpretation of uncertainty principle. He defines some transformations on 
Conceptual Graph used as document index in order to find a projection from the 
query graph to the index graph. These transformations include specialization or 
generalization of node labels, node joints or node and edge additions. Moreover a 
mechanism is proposed to order sequence of transformations. As a consequence, 
the matching function based on projection become a ranking function and orders 
relevant documents for a query. 

3. The third interpretation of the uncertainty principle implies to transform query to 
establish the implication d~>q. Since the projection operator implies expensive 
treatments, it doesn't seem operational to process it during retrieval, in real time. 
Our proposition takes Ounis and Genest improvements into account, by proposing 

a graph matching function optimized for the information retrieval needs. Now, we 

present our graph formalism called semantic graph. 



4 Semantic Graph Model 

Ontology enables to define the graph vocabulary. To simplify the formalism, the 
notion of markers is eliminated from the conceptual graph model, because in our case, 
document index contain only generic notion. 

An ontology O is a 3-tuple O = (T c T R a) where: 

- T c is a set of concept type partially ordered by the specialization relation noted <. 

- T r is a set of binary relation types 1 partially ordered by < 

- o. called signature, is a mapping which associates with any relation types the 
greatest concept type of its arguments. In other words, for any t r e T R the type of 
the k tb argument of t r should be more specific than the type of the k" argument of a 
(tj. The i th argument of <7(t r ) is noted cr (7 r ). 



1 In general, a type of relation can have any arity, but in this paper, relations are considered to 
be only binary relations like case relations or thematic roles associated with verbs [7], 




102 



C. Roussey, S. Calabretto, and J.-M. Pinon 



Afterwards, we present our model called semantic graph. Comparing to Conceptual 
Graph, the accent is made upon relation between concepts. All concepts should be 
linked to at least another concept to build an arch. 

A semantic graph is a 4-tuple Gs = (C, A, //, v) where : 

- C is a set of concept nodes contained in Gs. 

- A cC x C is a set of arches contained in Gs. For each arch a= (c, c’j e A; the i‘ 
concept node of a (also called argument of a) is noted a r - (a I = c and a , = c ’ ). The 
set of concept nodes (argument) belonging to one of the arches of A is noted A r 

- //: C —> T ( . n is an application, which associated for each concept node, c e C, a 
label fj(c) & T c , jU(c) is also called the type of c. 

- v: R —t> T R pis an application, which associated for each arch, a e A, a label U.a) e 
T r . jU(c) is also called the type of a. 

A semantic graph checks some constraints : 

1. For each a e A/ a = (c, c’), v(a)=r , JU (a) <o;(r) so ju(c) < (J 1 (r) and fl(c’) < a 2 

( r ). 

2. All the nodes concepts belong to at least one arc, so C czA 2 uA 2 

The pseudo projection operator is an extension of the projection operator of Sowa 
Conceptual Graph. A pseudo projection defines morphism between graphs with less 
constraint than the projection operator does. The existence of a pseudo projection of a 
graph H in a graph G is equivalent to the fact that a part of the information repre- 
sented by G is close to the information represented by H. 



4.1 Pseudo Projection Operator: 

A pseudo projection from a semantic graph H = (C H , A H , /J H , v H ) to a semantic graph 
G = (C G , A g , jU g , v G ) is a pair of mapping 17= (fg), such as f: A H —>A G , associates an 
arch of H with an arch of G and g: C H C G associates a concept node of H with a set 
of concept nodes of G. 77has the following properties: 

1 . Concept nodes can not be preserved. 

For any concept nodes c, c' e C ir it is possible that g(c) = g(c’). 

Two concept nodes of H can have the same image by 77(for example, a same node 
of G). 

For any concept nodes c e C H , it is possible that g(c) = fb, b'j such that b and b ' e 

C a 

A concept node of H can have several images by 77(two distinct concept nodes of 
G). 

2. Type of concept nodes can be restricted. 

For any concept node c of C H , if there exists b e C a such that b e g(c) then jU G (b) 
<jU H (c). 

For any concept node belonging to the graph H. the type of its image per 77 is 
more specific than its type. 

3. Type of concept node can be increased. 

For any concept node c of C H , if there exists b e C a such that b e g(c) then /U H (c) 
<jU G (b). 




A Multilingual Information System Based on Knowledge Representation 



103 



For any concept node belonging to the graph 77, its type is more specific than the 
type of its image per 77 

4. Arches are preserved. 

For any arch a = (c, c’) of A H there exists f(a)=( b, b’) an arch of A c such that b e 
g(c) and/Fe g(c’). 

5. Type of arch can be restricted. 

For any arch label v H (a),f( V G (a)) <v H (a) 

For any arch belonging to the graph H, the type of its image per 77is more specific 
than its type. 

6. Type of arch can be increased. 

For any arch label V H (a), V H (a) <f( V G (a)) 

For any arch belonging to the graph H, its type is more specific than the type of its 
image per 77 

Indeed, pseudo projection operator enables to see if all the information represented by 
77 are close to some information represented by G. Now we should define a mecha- 
nism enabling to see if part of the information from 77 are close to some information 
from G. At this point, the document doc of the Fig. 2 answers the query ql, but doc 
still be not relevant to q2. 



4.2 Partial Pseudo Projection Operator 

There is a partial pseudo projection from 77 to G if there is 77’, a subgraph of 77 such as 
there exists a pseudo projection from 77 ’to G. 

Let us point out that, the pseudo projection operator does not preserve the number 
of concept nodes, because we focus on relation between concepts, so arches are more 
important than concepts. Moreover we think that it is no need to preserve the whole 
structure of a graph for IR purpose because semantic graph is equivalent is considered 
to be equivalent to its set of arches where concepts are unique. 

From specialization relation, similarity functions between types will be defined. 
Consequently, evaluation of the similarity between two arches will be possible. Then, 
an evaluation of the pseudo projection operator between graph will enable us to define 
the function of similarity between two semantic graphs. 



4.3 Similarity Function between Types 

The specialization relation enables to define a similarity function between types, 
noted sim. sim is an asymmetrical function and it returns a real value ranging from 0 
to 1. 

sim : T c xT c uT R xT R — > [0. . 1 ] 
sim is defined as follow: 

- If two types are identical then the similarity function returns value 1. 

- If a type t t specializes another type t 2 directly, i.e. there is not intermediate type be- 
tween f ( and t 2 in the type hierarchy, then the similarity function returns a constant 
value, fixed arbitrarily, lower than 1 such as: sim(t 2 , t 2 ) = V s and sim(t r t 2 ) = V G 




104 



C. Roussey, S. Calabretto, and J.-M. Pinon 



- If a type t i specializes another type t 2 directly, i.e. there is an intermediate type t 
between f ; and t 2 in the type hierarchy then the similarity function between 1 1 and t 2 
is the product of the similarity functions between (t,, t) and (/, t 2 ) that is to say : 
sim(t 2 , tj) = sim(t 2 , t) Xsim(t, t 2 ) and sim(t 2 , t 2 ) = sim( f ( , t) Xsim(t, t 2 ) . 



4.4 Similarity Function between Arches 

The similarity function between types enables to define a similarity function between 
two arches, noted Sim A . Sim A is a float function and it returns a value ranging from 0 to 
1 . This function is defined as follows: 

a H is an arch such as a H - (c H , c’ H ) and \fa H ) = r H and a G is an arch such as a G = (c G , 
c’ G ) and tfflj = r G . 

2 (1) 
sim $11 cif] , $q Uq V sim 

Sim A (a H , a G ) ! lU 



Sim A (a H , a G )\ 



sim r H , r G V sim 3 c H , 3 c G V sim 3 , 3 c\ 



( 2 ) 



Indeed, Sim A compute the average of the similarity between each arch component. 



4.5 Pseudo Projection Evaluation 

Considering two semantic graphs, pseudo projection from one graph to the other one 
is possible or not. Rather than a boolean result, a real value is preferred so a computa- 
tion is proposed to evaluate the resemblance between two graphs. 

If there is a pseudo projection /7from a graph H = (C H , A H , fJ H , v H ) to a graph G = 
(C G , A c , jU g , V G ), then this pseudo projection is evaluated by a real function, noted val, 
defined as follows. 

sim A a , fa (3) 

Xr I a % Af 

a h 

val is the average of the similarity function between each arch of H and its image in G 

by n. 



4.6 Similarity Function between Graphs 

The similarity function, noted sim G , between a graph II = ( C H , R H , U N , //„ , v H ) and a 
graph G = (C G , R G , U G , ,U G , V G ) is a real function returning values ranging from 0 to 1. 




A Multilingual Information System Based on Knowledge Representation 



105 



Or 



sim G H, G 



Sim G (H, G) = max(val( IT)). 



MAX & sim A a , & a 

a %A h 

\ a h\ 



where [J is a partial pseudo projection from H to G. 



(4) 

(5) 



5 Indexing and Information Retrieval Algorithms 

After introducing our model of semantic graph, we shall concentrate on the organiza- 
tion of the indexes for faster retrieval. Our structure consists on an inverted file and 
several acceleration tables (see Fig. 3). The inverted file groups in the same entry all 
the documents indexed by an arch. This means that, given an arch, we can immedi- 
ately locate the document indexed by that arch. This is the basis of the inverted file 
construction. 

The acceleration tables enables to store the component of the arches and the se- 
mantic values between components. Indeed, The acceleration tables pre-compute all 
the similarity values of possible query arches before interrogation. The construction 
of the inverted file and the acceleration table is done off-line, as part of the indexing 
procedure. 



5.2 Indexing Algorithm 

Doc is a document. Graphlndex is the semantic graph in- 
dexing Doc. 0 is an ontology. FirstArgValue is the ac- 
celeration table containing the first argument (node 
concept) of arches. SecondArgValue is the acceleration 
table containing the second argument (node concept) of 
arches. RoleValue is the acceleration table containing 
the label of arches (relation) . InvertedFile is the in- 
verted file including - in the same entry - all the 
documents indexed by an arch. 

For each arch ArcDoc of Graphlndex do 

If NotFind (ArcDoc) then 

TArgl is the type of the first concept argu- 
ment of ArcDoc, TArg2 is the type of the second concept 
argument of ArcDoc, TRelation is the type of the label 
of ArcDoc 




106 



C. Roussey, S. Calabretto, and J.-M. Pinon 



For all TypeC related to TArgl by the spe- 
cialization relation in 0 do 

WeightArgl <- SimC (TypeC, TArgl) 

FirstArgValue . AddTuple (ArcDoc , TypeC, 

WeightArgl ) 

EndFor 

For all TypeC related to Targ2 by the spe- 
cialization relation in 0 do 

WeightArg2 <- SimC (TypeC, TArg2) 

SecondArgValue . AddTuple (ArcDoc, TypeC, 

WeightArg2 ) 

EndFor 

For all TypeR related to TRelation by the 
specialization relation in 0 do 

WeightRel <- SimR (TypeR, TRelation) 

RoleValue .AddTuple (ArcDoc, TypeR, 

WeightRel ) 



EndFor 

Endlf 

InvertedFile . AddTuple (ArcDoc, Doc) 

NotFind (ArcDoc) is a boolean function, which returns 
true if ArcDoc already exists in the Data Base. 

An important part of the complexity of pseudo projection operator is done at indexing 
time. Because for each indexing component, we find all components which can be 
related to it (component can be specialized by or can specialize the indexing compo- 
nent) and then we compute the similarity between them. This makes it easier to com- 
pute the similarity between a query arch and an indexing arch. 




A Multilingual Information System Based on Knowledge Representation 



107 



5.2 Retrieval Algorithm 

By making use of the pre-computations that are comprised in the acceleration tables, 
the retrieval function performs -in one operation- the evaluation of the pseudo projec- 
tion between query graph and indexing graph. 

GraphReq is a query graph and nbArc is the number of 
arches in GraphReq. Threshold is a fixed threshold to 
filter result documents. ListDocResu.lt is a weighted 
list of documents. 

For each arch ArcReq of GraphReq do 

ListArcIndex <-FindRelatedArc (ArcReq) 

For each (Arclndex, WeightArc) of ListArcIndex 
do 



ListDoc <- FindDocList (Arclndex) 

// For a query arch, the document weight is the maximum 
of similarity values between the query arch and one of 
its indexing arch 

For each Doc of ListDoc do 

If ListDocArc . Belong (Doc) Then 

Weight ^-ListDocArc . FindWeight (Doc) 

NewWeight <- max (Weight, WeightArc) 

ListDocArc . ReplaceWeight (Doc , NewWeight) 

Else 

ListDocArc .Add (Doc , WeightArc) 

Endlf 

EndFor 

EndFor 

//For a query graph, the document weight is the sum of 
all the similarities values (WeightArc) between a query 
arch and one of its indexing arch divided by the number 
of query arches. 




108 



C. Roussey, S. Calabretto, and J.-M. Pinon 



For each (Doc, WeightArc) of ListDocArc do 

If ListDocResult . Belong (Doc) Then 

Weight <- ListDocResult . FindWeight (Doc) 

NewWeight <- Weight + (WeightArc / nbArc) 

ListDocResult . ReplaceWeight ( Doc, New- 
Weight ) 

Else 

ListDocResult .Add (Doc, WeightArc) 

EndFor 



EndFor 

For each (Doc, WeightArc) of ListDocResult do 
If WeightArc < Threshold Then 
ListDocResult . Remove (Doc , WeightArc) 



EndFor 

FindRelatedArc (ArcReq) is a function, which returns a 
list (ListArcIndex) , of arches (Arclndex) associated 
with a similarity values (WeightArc) . 

In the retrieval algorithm, the partial pseudo projections are obtained in polynomial 
time, as the most part of the algorithm consist in tables joint. Usually the cost of a 
projection operator between graph is over estimated, as it is often link to graph theory 
and want to find a morphism between indefinite structure. To over come this problem, 
we have limited the graph structure. We consider that a graph is a set of arches and 
that a concept node is unique in a semantic graph. Now we should evaluate if our 
limitation can obtain could result during a retrieval process. 



6 The SyDoM Prototype 

Our multilingual retrieval system is called SyDoM (Multilingual Documentary Sys- 
tem)[5]. The system is implemented in JAVA on top of the Microsoft Access Data- 
base system. SyDoM is composed of three modules: 

- The ontology module manages the documentary language with new vocabulary 
and new domain entity. Documentary language is used for indexing and querying a 
multilingual document collection. 




A Multilingual Information System Based on Knowledge Representation 



109 



- The indexing module indexes and annotates XML documents with semantic graphs 
using a set of metadata associated to the ontology. 

- The retrieval module performs multilingual retrieval. 

Actually the ontology can be displayed in different languages (French and English). 
This system performs also manual indexing process for XML documents using a set 
of metadata associated to the ontology. Our experimentation was done on English 
articles dealing with mechanics called pre-print of the Society of Automotive Engi- 
neers (SAE). The first step of the experiment is to build a domain ontology. Thanks to 
a mechanical thesaurus, we managed to have 105 mechanical concepts stored in our 
ontology. On top of that, we added 35 relations found in Sowa Knowledge Book [7]. 
During manual indexing only titles are taken in account. For our first experiments, we 
have manually indexed approximately 50 articles and used 10 queries. The average 
indexing graph consists of 4 arches and the average query graph consists of 2 arches. 

The users introduce their queries through the query interface presented in Fig. 3. 
This figure correspond to the French query "modele de combustion" (combustion 
model) 




assss __ R1 1.2 2.2.2 | - -- 

Fig. 3. SyDoM interface 



When retrieval is started, the data introduced by the user are collected and an internal 
representation of the query is produced. The results are ordered from the most rele- 
vant document to the less pertinent one. 

To evaluate our system, we compare it to the IR system used at the scientific li- 
brary of our Institute, called Doc'INSA. In this system, documents and queries are 
represented by a list of keywords. The matching function between documents and 
queries evaluates the number of common keywords. The indices of this system were 




110 



C. Roussey, S. Calabretto, and J.-M. Pinon 



generated automatically from the index graphs of SyDoM, to avoid to take the index 
variability into account. 

The next figure presents the performance of our system using different threshold 
(0.8, 0.6, 0.5, 0.4, and 0). We compute the average precision for ten recall intervals. 
The constant value of type similarity function are arbitrarily fixed (V G = 0.7 and V s = 
0.9). 




— 


- 0,8 


— ■ 


- 0,6 




0,5 


— X - 


0,4 


X 


-0 



Fig. 4.: SyDoM evaluation using different thresholds 

The trend of the curve can be explained by the fact that our collection size is small. 
Therefore, most part of the queries deals only with few documents. Because these 
documents are retrieved with an important weight (more than 0.8), the precision is 
good whatever the recall could be. 

The next figure presents the comparison of SyDoM with the DocTNSA system. We 
can noticed that relations treatment and hierarchy inference improve significantly the 
quality of the answer even for manual indexing. 




Fig. 5. Evaluation of SyDoM (threshold = 0.6) and DocTNSA system 




A Multilingual Information System Based on Knowledge Representation 



111 



The next step would be to compare our system to RELIEF and David Genest One. 
We aim to experiment if our model -comparing to the extension of CG proposed by 
David Genest- could have similar results with less computational time. The main 
challenge is that David and I deal with manual indexing process so it is not easy to 
find human resources in order to carry on a real experiment. 



7 Conclusion 

In this paper, we proposed a prototype of a Multilingual Information System focusing 
on the indexing and information retrieval modules. We have defined a new model of 
conceptual graph in order to enhance the effectiveness of our retrieval system. We 
have, indeed, integrated a new extension of CG proposed by Genest and a fast re- 
trieval technique. We have already noticed that our proposal is operational and give 
better results than traditional documentary system. This is very encouraging for a first 
implementation. At this point, we need to carry on further experiments. 



References 

1 . D. Genest. « Extension du modele des graphes conceptuels pour la recherche d'informa- 
tion ». PhD Thesis, Montpellier University, Montpellier, France 2000. 

2. J.Y. Nie. « un modele logique general pour les systemes de recherche d'informations. Ap- 
plication au prototype RIME ». PhD Thesis, Joseph Fouriei University, Grenoble, France 
1990. 

3. I. Ounis, M. Pasga. « RELIEF: Combining expressiveness and rapidity into a single sys- 
tem ». Proceeding of 18 lh SIGIR Conference, Melbourne, Australia, p 266-274, august 
1998. 

4. C.J. van Rijsbergen. « A new Theoritical Framework for Information Retrieval ». Proceed- 
ing of the 9 th SIGIR Conference, Pisa, p 194-200m septembre 1986. 

5. C. Roussey, S. Calabretto, J. M. Pinon « Un modele d'indexation pour une collection mul- 
tilingue de documents». Proceeding of the 3 rd CIDE Conference, Lyon, France, p 153-169, 
July 2000 

6. J. Sowa. « Conceptual Structures: information processing in mind and machine ». The 
System Programming Series, Addison Wesley publishing Company, 1984. 

7. J. Sowa. « Knowledge Representation: Logical, Philosophical, and Computational Founda- 
tions ». Brooks Cole Publishing Co., Pacific Grove, CA., 2000. 




Capturing Fuzziness and Uncertainty 
of Spatiotemporal Objects 



Dieter Pfoser and Nectaria Tryfona 

Computer Science Department, Aalborg University 
Fredrik Bajersvej 7E, DK-9220 Aalborg East, Denmark 
{pfoser, tryfona}@cs .auc . dk 



Abstract. For the majority of spatiotemporal applications, we assume that the 
modeled world is precise and bound. This simplification seems unnecessary 
crude for many environments handling spatial and temporal extents, such as 
navigational applications. In this work, we explore fuzziness and uncertainty, 
which we subsume under the term indeterminacy, in the spatiotemporal con- 
text. We first show how the fundamental modeling concepts of spatial objects, 
attributes, relationships, time points, time periods, and events are influenced by 
indeterminacy, and then show how these concepts can be combined. Next, we 
focus on the change of spatial objects according to their geometry over time. 
We outline four scenarios, which identify discrete and continuous change, and 
we present how to model indeterminate change. We demonstrate the applicabil- 
ity of this proposal by describing the uncertainty related to the movement of 
point objects, such as the recording of the whereabouts of taxis. 



1 Introduction 

Spatiotemporal applications received a lot of attention over the past years. Require- 
ments analysis [15], models [4], data types [8], and data structures [14] are some of 
the main topics in this area. Although considerable research effort and valuable re- 
sults do exist, all the studies and approaches are based on the assumption that, in the 
spatiotemporal mini-world, objects have crisp boundaries, relationships among them 
are precisely defined, and accurate measurements of positions lead to error-free rep- 
resentations. 

However, reality is different. Very often boundaries do not strictly separate objects 
but, rather, show a transition between them. Consider the example from an environ- 
mental system in which the different soil zones, such as desert and prairie, are not 
precisely bound. We encounter a transition, or fuzziness, between them. On the other 
hand, in navigational systems, the position of a moving vehicle, although precise in 
its nature, might not be exactly known, e.g., car A is in New York. We encounter 
uncertainty, i.e., lack of knowledge or error about its actual location. 

In this paper, we deal with fuzziness and uncertainty as related to spatiotemporal 
objects. More specifically, we start by pointing out the semantic differences between 
the two cases that constitute spatiotemporal indeterminacy, fuzziness, concerning 
“blurry” situations, and uncertainty, expressing the “not-exactly-known” reality. We 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 112-126, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




Capturing Fuzziness and Uncertainty of Spatiotemporal Objects 113 



clarify these terms in the spatial and temporal domains, as well as the combined ef- 
fect, i.e., spatiotemporal fuzziness and uncertainty. We show how the basic spatio- 
temporal modeling concepts of spatial objects, attributes, relationships, time points, 
time periods, events, and change are influenced by indeterminacy. We provide formal 
ways to describe this, while an example demonstrates the applicability of this pro- 
posal. A more elaborate discussion with the use of fuzzy set and probability theory in 
this area can be found in [16]. 

There are only few works towards spatiotemporal indeterminacy. [18] focuses on 
simple, abstract, spatial and temporal uncertainty concepts and integrates them to 
describe spatial updates in a GIS database. [13] discusses spatiotemporal indetermi- 
nacy for moving objects data. It is, however, limited to point objects and it does not 
take temporal errors into account. [2] aims at describing the change of fuzzy features 
over time using a raster representation. More work exists towards temporal, e.g., [5] 
and spatial indeterminacy, e.g., [1], [3], [7], [17], [19], [20]. 

The rest of the paper is organized as follows. Section 2 briefly presents the funda- 
mental spatial and temporal concepts involved in the spatiotemporal application do- 
main. Section 3 explores the semantics and gives the mathematical expression of 
indeterminate temporal concepts. Section 4 deals with indeterminate spatial concepts. 
Section 5 discusses change as the spatiotemporal concept affected by indeterminacy. 
Finally, Sect. 6 concludes with the future research plans. 



2 Spatial and Temporal Concepts 

To understand spatiotemporal indeterminacy, it is important to realize the fundamen- 
tal spatial, temporal, and spatiotemporal concepts. 

Spatiotemporal applications can be categorized based on the type of data they 
manage: (a) applications dealing with moving objects, such as navigational, e.g., a 
moving “car” on a road network, (b) applications involving objects located in space, 
whose characteristics and their position, may change in time, e.g., in a cadastral in- 
formation system, “landparcels” change positions by changing shape, but they do not 
“move,” and (c) applications integrating the above two behaviors, e.g, in environ- 
mental applications, “pollution” is measured as a moving phenomenon which changes 
properties and shape over time. The following modeling concepts are involved in 
environments like the aforementioned. 

• Spatial Objects and their geometry. Spatial objects are objects whose position in 
space matters, e.g., a moving “car.” Many times, not only the actual object’s position 
matters, but its geometry does as well. For example, in a cadastral system the exact 
geometry of a “landparcel” is of importance. The geometry of the position of a spatial 
object can be (of type) point, line, region or any combination thereof [10]. 

• Spatial Relationships. Spatial relationships relate spatial objects, or more precisely, 
the positions of the objects, e.g., two landparcels share common borders. 

• Spatial Attributes and their geometry. Spatial objects have, apart from descriptive 
attributes, also spatial attributes, e.g., the “vegetation” of a “landparcel.” Values of 
spatial attributes depend on the referenced position and not on the object itself. 




114 



D. Pfoser and N. Tryfona 



Spatial attributes are also related to geometries in space, as they split space in parts 
in whose extents the values of the spatial attributes remain the same; each part of 
space has (like, the objects’ positions) geometry (of type) point, line, region or any 
combination thereof. For example, the attribute “vegetation” creates partitions of 
space with constant vegetation values in each partition, such as “forest”, and 
“bushes.” There are two types of spatial attributes, (a) those representing functions 
with continuous range, e.g., “temperature.” Here the geometry of the partitions is 
point, (b) Those representing functions with discrete range, e.g., “vegetation” is rep- 
resented as a set of regions. Not all spatial objects have spatial attributes. For exam- 
ple, no spatial attribute is usually assigned to a moving “car,” while various ones, e.g., 
“soil type,” may be assigned to a “landparcel.” 

• Time. Many different models of time exist. Some authors even propose taxonomies 
of time. In our work we assume a linear ordered time line, isomorphic to a finite sub- 
set of the natural numbers. The elements of this set are termed chronons. 

• Time points vs. Time periods. Two basic models of time are used to record facts 
and information of a database: time point and time period. A time point / is located 
during a chronon, while a time period \t k , tj, with t., t m time points and k < m has a 
duration and is defined as a set of chronons. 

• Events and States. An event occurs at an exact time point, i.e., an event has no 
duration. For example a “car crash.” A state is defined for each chronon in a time 
point. It has duration, e.g., a “meeting” takes place from 9 am until 1 lam. 



3 Temporal Indeterminacy 

In temporal applications we are interested in events and their occurrence time. How- 
ever, sometimes we only know approximately when an event occurred, e.g., a traffic 
accident happened between “2 pm and 4 pm.” Next, we present models to represent 
indeterminacy in the temporal domain by adapting the model presented in [5], 

3.1 Indeterminate Time Points 

A time point is determinate if it is known during which chronon it is located. Fig la 
shows the determinate point /,, based on the approach that a chronon is longer than a 
time point. A time point is indeterminate if it is not known exactly when, but ap- 
proximately during which series of chronons it is located. An indeterminate time 
point is described by a lower support, an upper support, and a probability function 
[5], The supports are chronons that delimit the location of the time point, e.g., for 
time point / in Fig. la, the lower support is 5 and the upper support is 8; the probabil- 
ity function shows the likelihood where the time point is located within the range, 
e.g., in uniform distribution, it is equally likely for the time point to be located at 
chronons 5 to 8. 

In the following, we use probability and fuzzy set theory to quantify indetermi- 
nacy. The probability mass function, p , for the indeterminate point x is 

p x 0 ) = P [x = /] : i e N 



( 1 ) 




Capturing Fuzziness and Uncertainty of Spatiotemporal Objects 



115 









7 1 f 




' \ ^ 


h 

1 1 1 1 1 1 


h 

i i 


_j 


1 1 1 


1 


1 1 


1 1 1 1 1 1 
1 2 3 4 5 


6 7 


“1 

8 




1 




chronons 






1 2 3 


4 5 


6 7 8 



chronons 



(a) (b) 

Fig. 1 . (a) Determinate (/j) and indeterminate (/,) time points, (b) indeterminate time period, 
probabilities of bounding time points (solid line-probability density function, dashed line- 
probability mass function) 



where P[x = i] is the probability that the time point is located during chronon i. In 
our example, assuming uniform distribution, P[I 2 = 6] = 0.25 , the probability outside 
the range lower support-upper support is 0. Also, all indeterminate time points are 
considered to be independent, i.e., 

P[x = i a y = j] = P[x = i]xP[y = j] (2) 

We can state that all probability distributions are fuzzy sets [16]. By using the prob- 
ability mass function as basis we obtain the following membership function 

pji) = l,pji) (3) 

where X is an arbitrary scale factor relating the membership grade to the probability. 

3.2 Indeterminate Time Periods 

A time period is a subset of the time line bound by two time points. Depending on 
whether the bounding points are determinate or indeterminate, we term the time pe- 
riod accordingly. In Fig. 2b, and I 2 denote the indeterminate start and end point of 
the period. Possible periods can range from chronon 1 to chronon 8 (max), but at least 
have to range from 3 to 6 (min). 

The time period presented in Fig. lb can also be perceived as having a fuzzy 
boundary. Next, we derive a membership function, p T (x) , returning the degree to 
which an arbitrary chronon x is part of the time period T. From Fig. lb, we can de- 
duce that chronons 4 and 5 are definitely part of the time period T, whereas other 
chronons might be. Assuming a uniform distribution of the chronons within the time 
points /j and 7 2 , we can see that if chronon 2 is within the period so has to be chronon 
3. Further, if chronon 1 is within, so have to be chronons 2 and 3. The same is true for 
chronons 6, 7, and 8 of /,. Thus, in three cases chronon 3, in two cases chronon 2, and 
in one case chronon 1 is within period T. The probability mass function of 7, and / 2 as 
shown in Fig. lb gives the probability for a chronon to be in T. In summing up the 
probability from “the outside to the inside,” we obtain a step function, the probability 
density function. 

To derive the membership function, p T (x) , we have to split the time period T into 
three parts; (1) the “core” (chronons 4 and 5), (2) the intervals 7, and 7, , and (3) the 
outside world. A membership grade of 1 and 0 indicate definite and no membership 





116 



D. Pfoser and N. Tryfona 



in the time period, respectively. All chronons in the core have a grade of 1. The grade 
of the chronons in the intervals is equal to the value of the probability density func- 
tion. Formula 4 summarizes the membership function. 

1 y in core 

M t (x)= p(x ) y e 1 1 I 2 (4) 

0 otherwise 



4 Spatial Indeterminacy 

In the spatial indeterminacy area, [9] states that fuzziness is a property of a geo- 
graphic entity. Fuzziness concerns objects that cannot be precisely defined otherwise 
[6], On the other hand, uncertainty results from limitations of the observation, i.e., the 
measurement process [9], 

4.1 Indeterminate Spatial Objects, Relationships, and Attributes 

In the following, we point out the differences between spatial fuzziness and spatial 
uncertainty more prominently. Consider the example of the different soil zones, e.g., 
desert and prairie. Each zone is not precisely bound, but, rather, a blurry situation 
exists around their common boundaries. We can identify a location for which we are 
sure it is within the desert or the prairie, and we can find a location that is in-between. 
Consequently, the boundary between the two soil zones is fuzzy- However, for a forest 
divided into separate landparcels, we can clearly say what tree belongs to what land- 
parcel. The boundaries between the land parcels are crisp and thus certain. 

In contrast, let us consider the position of a moving vehicle whose location is not 
exactly known, e.g., a car is in New York. This example is characterized by a lack of 
knowledge about the car’s location. The fact that the car is somewhere is precise. 
However, the lack of knowledge we have about its position introduces uncertainty. 
Without further knowledge, we can only give the probable area the car is in. 

These examples indicate that the distinguishing element between fuzzy and non- 
fuzzy facts is a crisp boundary, i.e., when we cannot clearly say what belongs to 
what. The concept of boundary introduces the interior/exterior notion, i.e., what is 
within the boundary and what is outside. Spatial fuzziness occurs (a) in the relation- 
ships among spatial objects and (b) in spatial attributes. 

On the other hand, the distinguishing element between uncertain and certain facts 
is the lack of, or the error in our knowledge, i.e., not sufficient knowledge about an 
otherwise precise fact. As a result, spatial uncertainty can refer to the degree of 
knowledge we have about an object’s position. Uncertainty about an object's position 
leads to uncertainty about the spatial relationship among this object and its neighbors, 
e.g., if the exact boundary of a land parcel is not known, then, the exact relationships 
with its neighboring land parcels are not known either. Furthermore, uncertainty can 
exist for spatial attributes, when knowledge about them is limited. 




Capturing Fuzziness and Uncertainty of Spatiotemporal Objects 



117 



4.2 Indeterminate Geometry 

In this section, we examine in what ways fuzziness and uncertainty affect the concept 
of geometry. This is essential in defining spatial objects and spatial attributes; spatial 
relationships are defined in terms of the geometry of spatial objects. 

Points and regions are the most commonly met geometries in spatial applications, 
while line , is a special case of a region. Here, we only consider simple geometries, 
i.e., points and regions with no holes and no disconnected parts. The following cases 
exist; (a) Uncertain point. A point can be crisp and uncertain, e.g., we know the ap- 
proximate position of a car and can give probabilities for its location, (b) Fuzzy point. 
This case is not applicable, since the concepts of boundary and interior/exterior do not 
exist here, (c) Uncertain region. Consider the example of a landparcel with “not- 
exactly-known” (missing data) boundaries, (d) Fuzzy region. Since a region is deter- 
mined by its boundaries (something is inside/outside, or left/right), a region can be 
fuzzy, e.g., consider soil zones, whose boundaries are not crisp, but transitional. 

Indeterminate Points. We model Space as a set of points, homeomorphic to IN . The 
exact position of an object with geometry point is determinate, if it can be mapped 
onto a single point p e IN 2 . The position is indeterminate, if it can only be mapped to 
a set of points, i.e., the exact position is unknown. A probability function describes 
the likelihood for each point to be the position, e.g., uniform distribution tells us that 
there is an equal chance for each point. The probability mass function, p x , for the 
indeterminate point x is 

p x (i) = P[x = i] : i e {NxN} (5) 

where P\x = i ] is the probability that the position is mapped to point i, with i being a 
Cartesian coordinate. The probability that the position is outside the point set is 0. 
Further, all indeterminate positions are considered to be independent. 

What applies to time points, can also be applied to indeterminate points in the spa- 
tial context; probability distributions describing positional indeterminacy can always 
be interpreted as fuzziness. 

Indeterminate Regions. Indeterminate regions comprise uncertain and fuzzy regions. 
A region is a part of space bound by a connected set of points, the boundary. It can be 
determinate if the boundary points are determinate. Consequently, indeterminate 
points bound an indeterminate area. The following example illustrates this point. 

Uncertain Regions. Consider a map made up of two discrete regions, A and B, shar- 
ing a common boundary. If we repeatedly digitize the map, assuming that our process 
introduces errors, we obtain a set of points that lies close to the actual boundary line. 
However, there will be more points closer to the actual location of the line than fur- 
ther away from it. Due to lack of knowledge, this distribution might take the form of 
a normal distribution whose mean is centered at the “true” location of the line. In 
Fig. 2a, we show the normal distribution of a particular boundary point. Fig. 2b 
shows the probability function in the continuous case. 

Analogously, we can describe this uncertain region using a membership function. 
To determine this function that returns the grade to which an arbitrary point in space 




118 



D. Pfoser and N. Tryfona 




Fig. 2. Boundary point probability 



belongs to an area, we split the underlying space into three parts: (a) the core of the 
area, (b) the boundary region, and (c) the outside. Consequently, a membership func- 
tion for area A can be specified as follows. 



^(0 



1 i e A a ie B 

' ^ P ( x ) * e A/\B 

0 otherwise 



( 6 ) 



Area B stands for the outside of area A and p (x) is the probability mass function of 
a point for being in area A. The argument of the membership function is a point and it 
returns a grade for the membership of this point in area A. The grade is 1 if the point 
is a definite member of the area and 0 if it is definitely not a member of the area. 
Otherwise the grade is between 1 and 0 (cf. Fig. 2a). 



Fuzzy Regions. The above approach is only feasible when the probability function is 
known and simple, i.e., there is one probability function describing the distribution of 
all points in the boundary. If there were many probability functions, the membership 
function would become too complex to be useful. On the other hand, in some cases, 
we do not have “any information at air about the boundary of a region. Consider 
again the transition between soil zones. The boundary exists because of the very na- 
ture of a phenomenon that is not crisp and, thus, to give a probability function de- 
scribing it is not possible, or does not make sense. This illustrates the critical case for 
which fuzziness relieves uncertainty. We can still derive a valid membership function 
in assuming a smooth and steady transition from one zone to the other. A membership 
function for soil zones, as shown in Fig. 2b, could be characterized by the following 
formula (cf. [17]), 



1 



b a ( x ’ 



1 -d a /(d a +d b ) 



0 



if (x, y)e A 

if (x, y) £ A /\ (x,y)£ B 
otherwise 



(7) 



where d a and d h are the distances from a point (x,v) to the core area of the soil zones 
A and B. A formula for a distance d from an arbitrary point given by its coordinates 
(x,y) to an area A with the boundary B A is as follows 

d ((x, y),B A )= min{dist((x, y),(m, n)) I ( m,n)e B A } 
where dist(p,g) is the Euclidean distance between two points p,qe R : . 



( 8 ) 




Capturing Fuzziness and Uncertainty of Spatiotemporal Objects 119 



Above, the assumption is that the transition between the soil zones is linear. How- 
ever, the effect of other transitions on the membership function would change the 
formula describing the membership grade for positions outside the core. 



5 Spatiotemporal Indeterminacy 

After showing the nature of spatial and temporal indeterminacy as well as the way to 
model it, we describe the combined phenomenon, spatiotemporal indeterminacy. 
Consider the example of a moving vehicle, it is reasonable to assume that its extent 
does not matter in a given application, and, thus, can be reduced to point. To record 
its movement, we sample the object’s position. We cannot answer queries about an 
object’s movement at times in-between position samples unless we interpolate the 
positions, e.g., linear interpolation. 

For areal objects, the change of position includes the change of their centroid and 
shape, which has to be interpolated as well. Consider the indeterminate region exam- 
ple of an island. Tides have (a) a short-term effect on its coastline, whereas (b) over a 
longer period of time a general drift can be observed as well. If one is only interested 
in the general drift, the tidal effect can be modeled as a fuzzy boundary that changes 
over time. 

5.1 Spatiotemporal Scenarios and Indeterminate Change 

Change, or evolution, is the most important concept in the spatiotemporal context, 
and will in the following serve as the basis to evaluate spatiotemporal indeterminacy. 
As stated in literature [4], [8], [15], change (a) can either occur on a discrete or on a 
continuous basis and (b) can be recorded in time points or in time periods. 

Table 1 illustrates the four change scenarios encountered in the spatiotemporal 
context by using a 3-dimensional representation of the temporal change of geometry. 
Space (x- and y-coordinates in the horizontal plane) and time (time-coordinate in the 
vertical direction) are combined to form a three dimensional coordinate system. In the 
change scenarios, the elements that can be indeterminate (with respect to an object) 
are geometry, time point, and time interval. We use a point geometry to keep the illus- 
trations simple. However, the same change scenarios apply to other geometries. A 
discrete change of geometry from Gj to G j+l is indicated by using an arrow in the 
spatial plane as opposed to a line in case of a continuous change. In the following, we 
examine each scenario with respect to indeterminacy. 

The first case, Scenario 1 in Table 1, is the discrete change of a geometry recorded 
in time points. Geometry stays constant for some time and then changes instantly. It is 
sampled at constant time intervals dt. The geometry and/or the time point can be 
indeterminate. 

The second case, Scenario 2 in Table 1, is the continuous change of a geometry re- 
corded in time points. We sample a constantly changing geometry at time intervals dt. 
Knowing a geometry only at time points has two implications, (i) recording geome- 
tries at points means assessing a momentary situation without inferring anything 
about the geometry prior or past the time point. Consequently, (ii) time and space are 




120 



D. Pfoser and N. Tryfona 



Table 1 . Four spatiotemporal change scenarios 



Change 

Time 



Discrete 



Continuous 



Point 



1) Geometry is recorded at a time point. 
It may or may not differ from the previ- 
ously recorded one. We do not know 
when the change occurred. 



2) Geometry is sampled at time points. 
In between time points we have no 
knowledge about the geometry. 




Period 



3) Geometry is valid for a given time 
period. After a change, a new time pe- 
riod starts. 



4) Geometry is sampled at time points, 
the starting and end points of the time 
period. A time period is assigned a 
“change” function that models the 
positional change within the period. 




independent; not knowing the exact extent of the geometry does not affect the time 
interval and vice versa. 

In contrast, Scenarios 3 and 4 in Table 2, suggest that a change function of the 
form C : t x — > G x exists that determines a geometry G x for a time point t x in an 
interval spatially bound by the two geometries Gj and G i+1 and temporally bound by 
the time interval T = [t,f. +| ] . The change function C can be different for every time 
interval. 

The third case. Scenario 3 in Table 1, is the discrete change of a geometry re- 
corded in time intervals. The objective is to “begin” a new interval when a spatial 
change occurs, i.e., new time intervals start at the time points t 0 through t 4 . The geome- 
try is constant within a time interval. Spatial and temporal indeterminacy affect each 
other. Dealing with indeterminate spatial extents, e.g., uncertainty induced by meas- 
urement errors, implies that the time point at which a change occurs cannot be de- 
tected precisely. On the other hand, having an indeterminate temporal event, e.g., 
clock errors, introduces spatial indeterminacy. 

The last and most complex case, Scenario 4 in Table 1, is the continuous change 
of a geometry recorded in time intervals. This case is based on the fact that for a 





Capturing Fuzziness and Uncertainty of Spatiotemporal Objects 121 



Table 2. Change scenarios without temporal indeterminacy 



Geometry (G„ G f+1 ) 


Time (t, t.J 


Change 


Determinate 


Determinate 


C : t — > G r , where G r , depending on the change 
function, is determinate or indeterminate ( G x ) 


Indeterminate 


Determinate 


(a) C : t — > G , where G represents a prob- 
ability, P (/) , or a membership function, p (i) 

(b) p x (i,t) or P x (i,t) 



given time interval T = [f , t +l ] , there exists a change function that models the trans- 
formation from geometry G f to G i+1 . Each of these factors, i.e., (i) the time interval, 
(ii) the geometry, and (iii) the change function, can be subject to indeterminacy. 

In the simplest case, the geometry G j and G i+1 and the time interval T are deter- 
minate, and the change function returns a determine geometry G x for a given time 
point t x gT . Here, we assume that the change function returns the geometry coincid- 
ing with the actual movement. Is this not the case, the change function interpolates in 
between the geometries G f to G i+1 and returns an indeterminate geometry. An exam- 
ple is to use linear interpolation, i.e., the two geometries G f to G i+1 are considered to 
be the endpoints of a line. Section 5.2 gives an elaborate example of a change func- 
tion for this case. 

If we further allow G j and G i+l to be indeterminate, our change function would in 
any case return an indeterminate G x . In the following, we use the symbol on top 
of the parameter to denote indeterminacy. This means that if a geometry is described 
by a probability or membership function, this very function is subject to change in the 
time interval 7] . 

Following the idea from before, we would have a change function that returns a 
probability or membership function for a given t x (cf. Table 2(a)). However, by inte- 
grating the temporal component, we obtain a spatiotemporal probability or member- 
ship function, i.e., a function that changes with time (cf. Table 2(b)). 

Until now, we always considered time to be determinate. We use time points to de- 
termine the start and the end of the current time interval 7j , and to denote the time 
point in question, t x . In case t j and t i+l are indeterminate, we cannot state the begin- 
ning and the end of the time interval precisely. Thus, the association of a geometry 
(indeterminate or not) to a time point becomes indeterminate. However, this affects 
mainly the change function and can be considered in adapting its form. In considering 
an indeterminate time interval, we cannot, for any time point in the time interval, give 



Table 3. Change scenarios incorporating temporal indeterminacy 



Geometry (G., G. +1 ) 


Time (t., t.J 


Change 


Determinate 


Indeterminate 


C ; t x — > G x 


Indeterminate 


Indeterminate 


(c) C \t x — > G , where G r is either a prob- 
ability, P(i) , or a membership function, p (i) 

(d) pji,t) or P x (i,t) 






122 



D. Pfoser and N. Tryfona 




(a) 



time 




x 



(b) 




last recorded 
state 



inital state 



Fig. 3. Movements and space 



a geometry as it would be unaffected by determinate time, but the indeterminate time 
contributes some additional indeterminacy. Table 3 adapts the approach shown in 
Table 2 to cover this case. 

The central element of spatiotemporal indeterminacy is the change function ma- 
nipulating geometries. This function can be seen similar to a morphing algorithm 
between different instances of geometries, i.e., point, line, or region. Next, we give an 
example illustrating the aforementioned concepts. 



5.2 An Example of Use - Tracking Vehicles 

Consider the application scenario in which we track the continuous movement of 
taxis equipped with GPS devices that transmit their positions to a central computer 
using either radio communication links or cellular phones. 

Acquiring Movement - Sampling Moving Objects. To record the movement of an 
object, we would have to know the position on a continuous basis. However, 
practically we can only sample an object’s position, i.e., obtaining the position at 
discrete instances of time such as every few seconds. 

The solid line in Fig. 3a represents the movement of a point object. Space (x- and 
y-axes) and time (t-axis) are combined to form one coordinate system. The dashed 
line shows the projection of the movement onto two-dimensional space (x and y co- 
ordinates). A first approach to represent the movements of objects would be to store 
the position samples and interpolate the in-between positions. The simplest approach 
is to use linear interpolation. The sampled positions become the end points of line 
segments of polylines. The movement of an object is represented by an entire polyline 
in three-dimensional space. In geometrical terms, the movement of an object is 
termed a trajectory (we will use “movement” and “trajectory” interchangeably). 
Fig. 3b shows a spatiotemporal space (the cube in solid lines) and several trajectories 
(the solid lines). The top of the cube represents the time of the most recent position 
sample. The wavy-dotted lines symbolize the growth of the cube with time. 




Capturing Fuzziness and Uncertainty of Spatiotemporal Objects 123 



Measurement Error. An error can be introduced by inaccurate measurements. Using 
GPS measurements in sampling, the error can be described by a probability function, 
in our case, a bivariate normal distribution P v 

1 * ' +yl 

P, (x, y) = -e 2 ° 2 (9) 

2 KG' 

where a is the standard deviation. For details on this error measure refer to [13]. 

Which Scenario? In Table 1 of Sect. 5.1, the sampling approach to assess the move- 
ment of objects is characterized by scenario 4. Tables 2 and 3 establish a foundation 
for giving a change function in between sampled position. Table 3 gives function 
templates in case the times of sampling are not known precisely. Flowever, GPS al- 
lows for precise timing and, thus, we neglect the effects of time. In Table 2, Scenario 
1 (determinate geometry) gives a function template in case the sampled positions are 
known precisely. GPS measurements are accurate but not precise. Scenario 2 (inde- 
terminate geometry) seems to be a match for our problem. Next we show how to 
establish a change function to determine the position of the moving object in-between 
sampling. We initially assume precise position samples. 

Sampling Uncertainty. Capturing the position using a GPS receiver at regular time 
intervals introduces uncertainty about the position of the object for the in-between the 
measurements. In this section, we give a model for the uncertainty introduced by the 
sampling, based on the sampling rate and the maximum speed of the object. 

The uncertainty of the representation of an object’s movement is affected by the 
sampling rate. This, in turn, may be set by considering the speed of the object and the 
desired maximum distance between consecutive samples. Let us consider the example 
of recording taxi movements. As a requirement, the distance between two consecutive 
samples should be maximally 10m. Given the maximum speed of a taxi as 1 50 km/h, 
we would need to sample the position at least 4.2 times per second. If a taxi moves 
slower than its maximum speed, the distance between samples is less than 10m. 

Since we did not have positional measures for the in-between position samples (cf. 
Fig. 4a, the object could be anywhere in between position samples), the best is to limit 
the possibilities of where the moving object could have been. Considering the trajec- 
tory in a time interval [t v fj, delimited by consecutive samples, we know two posi- 
tions, Pj and P„ as well as the object’s maximum speed, v m (cf. Fig. 4b). If the object 
moves at maximum speed v m from P and its trajectory is a straight line, its position at 
time t x will be on a circle of radius /; = v m (f, +t x ) around P l (the smaller dotted circle 
in Fig. 4b). Thus, the points on the circle represent the maximum distance of the ob- 
ject from P ] at time f . If the object’s speed is lower than v m , or its trajectory is not a 
straight line, the object’s position at time t x will be somewhere within the area bound 
by the circle of radius r v 

Similar assumptions can be made on the position of the moving object with respect 
to P, and t 2 to obtain a second circle of radius r 2 . The constraints on the position of the 
moving object mean that the object can be anywhere within the intersection of the 
two circular areas at time t x . This intersection is shown by the shaded area in Fig. 4b. 
We use the term lens for this area of intersection. We assume a uniform distribution 




124 



D. Pfoser and N. Tryfona 



X 






vy 



(a) (b) 

Fig. 4. (a) Possible trajectories of a moving object, (b) uncertainty between samples 



for the position within the lens, i.e., the object is equally likely anywhere within this 
lens shape. 

The sampling error at time t x for a particular position can be described by the prob- 
ability function of Equation 10, where t\ and r, are the two radii described above, ,v is 
the distance between the measured positions P, and P„ and A denotes the area of the 
intersection of the two circles. 



P 2 (x,y) 



1 /A for x 2 + y 2 < r* a (x - s) 2 + y 2 < rf 
0 otherwise 



(TO) 



To eliminate the radii in favor of the max speed and times, we can substitute 
v m (fj +t x ) and v m (t 2 — t x ) for the i\ and r,, respectively. This function describes the 
position of the moving object in between position samples. Thus, this function is an 
instance of the function template as described in Scenario 1 of Table 2. 



Combining Error Sources - a Global Change Function. Table 2 gives a template 
of a change function that incorporates indeterminate positions. Using our example, 
this translates to adapting Equation 10 such that the values for x and y are not precise 
but affected by the measurement error. A mathematical framework suitable for this 
problem is Kalman filtering [11], which combines various error prone measurements 
about the same fact into a single measurement resulting in a smaller error. This 
mathematical framework stipulates a method to combine uncertainty to reduce the 
overall error. Examples of applying Kalman filtering to the domain of vehicle naviga- 
tion are the integration of three independent positioning systems such as dead reckon- 
ing, map matching, and GPS, to determine the precise position of vehicles [12], 



6 Conclusions and Future Work 

The work presented in this paper concerns the spatial, temporal, and spatiotemporal 
indeterminacy, i.e., fuzzy and uncertain phenomena. We first show how the funda- 
mental modeling concepts of spatial objects, attributes, relationships, time points, 
time periods, and events are influenced by indeterminacy. Next, we focus on the 
change of spatial objects and their geometry in time. We argue that change can occur 




Capturing Fuzziness and Uncertainty of Spatiotemporal Objects 125 



on a discrete and on a continuous basis, as well as it can be recorded in time points 
and time periods. By combining these concepts, we present four different change 
scenarios, which are affected by indeterminacy to a various degree. The indetermi- 
nacy of change is formalized and combines the spatial and temporal concepts. Finally, 
the rather general concepts are applied to existing application areas. We discuss un- 
certainty existing in the context of moving-point-object applications. We give a 
change function to describe the position of moving objects in time, based on posi- 
tional samples. The change function is influenced by measurement errors and sam- 
pling uncertainty. 

Although mentioned, the paper does not discuss, directly, indeterminacy as related 
to relationships among spatial, temporal, or spatiotemporal objects. An extension of 
this work towards this direction is essential. Also, the mathematical models we pre- 
sented are concrete enough to describe and motivate indeterminacy related to the 
temporal, spatial, and spatiotemporal domain. However, to actually implement these 
concepts, more detailed mathematical formulas are needed. Finally, in a more general 
framework, this work points towards the development of spatiotemporal data types 
and data structures incorporating indeterminacy. 



References 

1. Burrough, P.A., MacMillan, R.A., amd van Deursen, W.: Fuzzy Classification Methods 
For Determining Land Suitability from Soil Profile Observations and Topography. Journal 
of Soil Science, 43, pp. 193-210, 1992. 

2. Cheng, T. and Molenaar, M.: Diachronic Analysis of Fuzzy Objects. Geolnformatica 3(4), 
pp. 337 - 355, 1999 

3. Chrisman, N.: A Theory of Cartographic Error and Its Measurement in Digital Databases. 
In Proceedings Auto-Carto 5, pp. 159-168, 1982. 

4. Claramunt, C., and Theriault, M.: Managing Time in GIS: An Event-Oriented Approach. 
Recent Advances in Temporal Databases, Springer- Verlag, pp. 142-161, 1995. 

5. Dyreson, C.E., Soo, M.D., and Snodgrass, R.T.: The Data Model for Time. The TSQL2 
Temporal Query Language, Kluwer Academics, pp. 97-101, 1995. 

6. Fisher, P.: Boolean and Fuzzy Regions. Geographic Objects with Indeterminate 
Boundaries, Taylor & Francis, pp. 87-94, 1996. 

7. Goodchild, M. and Gopal, S. (Eds): Accuracy of Spatial Databases. Taylor & Franics, 
1989. 

8. Giiting, R., Bohlen, M., Erwig, M., Jensen, C.S., Lorentzos, N., Schneider, M., and 
Vazirgiannis, M.: A Foundation for Representing and Querying Moving Objects. ACM 
Transactions on Database Systems 25(1), pp. 1-42, 2000. 

9. Hadzilacos, T.: On Layer-Based Systems for Undetermined Boundaries. Geographic 
Objects with Indeterminate Boundaries, Taylor & Francis, pp. 237-256, 1996. 

10. Hadzilacos, T., and Tryfona, N.: A Model for Expressing Spatiotemporal Integrity 
Constraints. In Proc. of the International Conference on Theories and Methods of Spatio- 
Temporal Reasoning, pp. 252 - 268, 1992. 

1 1. Kalman, R.E.: A New Approach to Linear Filtering and Prediction Problems. Transactions 
of the ASME- Journal of Basic Engineering, pp. 35-45, 1960. 




126 



D. Pfoser and N. Tryfona 



12. Krakiwsky, E.J., Harris, C. B., and Wong, R.: A Kalman Filter for Integrating Dead Reck- 
oning, Map Matching, and GPS Positioning. In Proc. of the IEEE Position Location and 
Navigation Symposium, pp. 39-46, 1988. 

13. Pfoser, D. and Jensen, C.S.: Capturing the Uncertainty of Moving-Object Representations. 
In Proc. of the 6th International Symposium on the Advances in Spatial Databases, pp. 
111-132,1999. 

14. Pfoser, D. and Tryfona, N.: Requirements, Definitions, and Notations for Spatiotemporal 
Application Environments. In Proc. of the 6"' ACM Symposium on Geographic Informa- 
tion Systems, pp. 124-130, 1998. 

15. Pfoser, D. and Tryfona, N.: Capturing Fuzziness and Uncertainty of Spatiotemporal 
Objects. TimeCenter Technical Report, 2001. 

16. Pfoser, D., Jensen, C., and Theodoridis, Y.: Novel Approaches in Query Processing for 
Moving Objects Data. In Proc. of the 27“ Conference on Very Large Databases, pp. 395- 
406, 2000. 

17. Schneider, M., Metric Operations on Fuzzy Spatial Objects in Databases, In Proc. of the 8"' 
ACM Symposium on Geographic Information Systems, 2000. 

18. Shibasaki, R.: Handling Spatiotemporal Uncertainties of Geo-Objects for Dynamic Update 
of GIS Databases from Multi-Source Data. In Advanced Geographic Data Modeling, 
Netherlands Geodetic Commission, Publications on Geodesy, 40, pp. 228-243, 1994. 

19. Vazirgiannis, M.: Uncertainty Handling in Spatial Relationships. In Proc. of the ACM 
Symp. on Applied Computing, 2000. 

20. Worboys, M.: Imprecision in Finite Resolution Spatial Data. Geolnformatica 2(3), pp. 
257-279, 1998. 




Probability-Based Tile Pre-fetching and Cache 
Replacement Algorithms for Web Geographical 
Information Systems 



Yong-Kyoon Kang, Ki-Chang Kim, and Yoo-Sung Kim 



Department of Computer Science & Engineering 
INHA University, INCHEON 402-751, Korea" 
yskimiSinha . ac . kr 



Abstract. In this paper, an effective probability-based tile pre-fetching algo- 
rithm and a collaborative cache replacement algorithm for Web geographical 
information systems(Web GISs) are proposed. The proposed tile pre-fetching 
algorithm can approximate which tiles will be used in advance based on the 
global tile access pattern of all users and the semantics of query so that a user 
request will be answered quickly since the needed tiles are likely in cache data- 
base. When a client runs out of cache space for newly down-loaded tiles, the 
proposed cache replacement algorithm determines which tiles should be re- 
placed based on the future access probabilities. By combining the proposed tile 
pre-fetching algorithm with the cache replacement algorithm, the response time 
for user requests can be improved substantially in Web GIS systems. 



1 Introduction 

With the rapid growth of computer hardware and software technologies and the user’s 
requirements for geographical information, geographical information systems (GISs) 
that can analyze, process, and manage geo-spatial data have been developed and be- 
come very popular in several fields, e.g. civil engineering and computer engineering 
([1]). Furthermore, since the Internet and World Wide Web(WWW) have become 
very popular in real worlds, users can get geographical information at a low cost from 
the Web servers that can provide geographical information. These systems are re- 
ferred to as Web GIS systems ([2, 3, 4, 5, 6]). 

The types of Web GIS systems can be classified into server-side Web GIS and cli- 
ent-side Web. In server-side Web GIS systems, since the server has to process all re- 
quests of all clients, the server might be over-loaded, and the response time for user 
requests may become too slow ([2,7]). On the other hand, in client-side Web GIS sys- 
tems, the client loads the geographical data processing modules from the server when 
it makes a connection to the server. From then on, the client can process users’ re- 
quests by itself. Recently, client-side Web GIS systems have become very popular 
and several real systems are being developed. 

A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 127-140, 2001. 

© Springer- Verlag Berlin Heidelberg 2001 




128 



Y.-K. Kang, K.-C. Kim, and Y.-S. Kim 



The granularity of transmission from the server to the client can be either a whole 
map or a tile of the map ([2]). If the granularity is a whole map, the server searches all 
spatial objects and all aspatial information on GIS database to retrieve relevant ob- 
jects and information and sends the map with the retrieved aspatial information to the 
client. However, a map may be of very large size, since a map can include a large 
number of objects. It is clear that the larger the size of a communication unit becomes, 
the more loading time is needed between the server and the client. To reduce the ini- 
tial loading time, many systems have adopted the concepts of tiling and layering. Til- 
ing divides the map into several small pieces so that each of them can be transferred 
in a short time, while layering partitions the map into several layers such that each 
layer represents some specific information. 

Tiling can minimize the initial user’s response time, but it alone can’t minimize the 
total response time. To minimize the total response time, the system should pre-fetch 
some tiles that are likely to be accessed in advance and save them in a cached data- 
base for future reusing. When the user requests the tiles that have been pre-fetched 
and saved at the cached database, the client can give these tiles to the user without the 
communication delay to fetch the required tiles from the server. 

In this paper, an efficient tile pre-fetching algorithm for Web GIS based on users’ 
global access pattern is proposed. In the proposed algorithm, the server collects and 
maintains the transition probabilities between adjacent tiles. With these probabilities 
the server can predict which tiles have the higher probability of accessing in next time 
than others, and by pre-fetching those recommended tiles, the client can respond to 
the user’s requests much faster. 

When the client’s cache is run out of space, the client should determine which tiles 
to replace with newly fetched tiles. Those tiles that are not likely to be accessed in the 
near future can be replaced, and the client should be able to select such tiles. We pro- 
pose a cache replacement algorithm that predicts the future usage of the tiles cor- 
rectly, based on the same access probabilities that are calculated and used for tile pre- 
fetching. The proposed cache replacement algorithm selects tiles with small transition 
probabilities from the current requested tile as candidate tiles for replacement. 

The rest of this paper is organized as following. In Sect. 2, we discuss the architec- 
ture of Web GIS systems and describe query processing in it. In Sect. 3, we propose 
an efficient tile pre-fetching algorithm that can determine a set of tiles that are likely 
to be requested in the near future, based on the global tile access patterns and a cache 
replacement algorithm that can collaborate with the proposed tile pre-fetching algo- 
rithm in Web GIS systems. We also discuss an example that can show the effective- 
ness of the proposed tile pre-fetching algorithm and cache replacement algorithm. 
Finally, we conclude the paper in Sect. 4. 




Probability-Based Tile Pre-fetching and Cache Replacement Algorithms 



129 



2 Query Processing in Web GIS Systems 



The general architecture of Web GIS systems in which the proposed tile pre-fetching 
and cache replacement algorithms are used is in Fig. 1 ([2]). In Fig. 1, we do not in- 
clude all components of Web GIS systems. That is. Fig. 1 shows the abstracted archi- 
tecture of Web GIS systems. 



Browsing 

Commands 



GIS queries 




Fig. 1. Abstracted architecture of Web GIS systems 



A Web GIS system mainly consists of two components; clients and a server. A client 
is the Web browser with several data processing facilities that are loaded from the server 
when the client makes a connection to the server. Server manages the GIS database that 
consists of spatial database and aspatial database and provides useful information to the 
clients when the clients submit user’s requests. As we discussed in the previous section, 
the server manages the spatial information in the unit of tile. That is, a map is decom- 
posed into a set of tiles that can be transferred to the client in a short time. 



130 



Y.-K. Kang, K.-C. Kim, and Y.-S. Kim 



In Web GIS systems, users’ requests are classified into two categories; browsing 
commands and GIS queries. GIS queries include point queries, region queries, and 
object retrieval queries with selection predicates. To use the point query, the user 
gives the coordinates (a, b) of a point in a map, and the client serves the tile that has 
the given point. To use the region query, the user gives a specified region in rectangu- 
lar form with (a i; b t ) and (a,, b 2 ) or circular form with the center point (a, b) and a ra- 
dius, and the client returns a set of tiles that covers the given region. For an object 
retrieval query with selection predicates, the client returns the objects that satisfy the 
given predicates on the map. Usually, GIS queries such as point queries, region que- 
ries, and object retrieval queries are used as the first request when users make a ses- 
sion to the Web GIS systems. That is, the user first submits a GIS query and, based on 
the result of the first GIS query, submits a sequence of browsing commands and/or 
GIS queries. Browsing commands include zooming and moving commands. To exe- 
cute a zoom-in command, by which the user can see the current position (a, b) of the 
map in more detail, if the required data has been cached at client, the client doesn’t 
have to go to the server. Otherwise, however, client should down-load more detailed 
map information from the server. For a zoom-out command, by which the user can 
view the map in wider area, the client should fetch neighbor tiles of the current one 
from the server. By using the moving commands, users can retrieve 4 neighbor tiles 
of the current tile. That is, by using a moving command, users can move to one of the 
4 neighbor tiles in the direction of up, down, left, and right, respectively. 

The formats of users’ requests in Web GIS systems are as following. 

Point_Query (a, b) 

Rectangle_Region_Query (ai, bi, a 2 , b 2 ) 
Circle_Region_Query (a, b, radius) 

Obj et_Retireval_Query (selection_predicates) 

Zoom-in (a, b, smaller_radius ) 

Zoom-out (a, b, larger_radius ) 

Moving (a, b, direction) 

In Fig. 1, user’s request is processed as following. When a user submits a request 
to the client. Query Analyzer and Executor (QAE) analyzes the user’s request and 
executes it. To execute user’s request, QAE requests necessary data that should be 
processed for user’s request from Cache Manager (CM). If CM can find the data in 
the cached database, it transfers the data without requesting to the server. Otherwise, 
CM sends the request to Pre-fetch Agent (PA) to retrieve the necessary data from the 
server. PA basically tosses the user’s request to Pre-fetch Executor (PE) at the server. 
However, to give some information needed for tile pre-fetching to the server, PA adds 
some additional information to the original users’ requests. That is, in addition to the 




Probability-Based Tile Pre-fetching and Cache Replacement Algorithms 



131 



original request, PA also gives the list of tiles that have been cached, the transition 
frequencies among the cached tiles since the last connection to the server, and the 
number of tiles it wants to pre-fetch. The pre-fetching size can be determined on the 
basis of the size of free space for tile pre-fetching, and the regularity degree of access 
pattern of the user at the client. 

When PE receives a modified request from the PA, PE decomposes the request into 
the original request and the augmented information. It sends the original request to 
Search Engine (SE) to retrieve the result of the user’s request and performs pre- 
fetching based on the retrieved result and the augmented information. To properly 
pre-fetch tiles that are likely to be accessed by the user in the near future, the pre- 
fetching algorithm uses the transition probabilities between tiles, and the details on 
the pre-fetching algorithm will be discussed in Sect. 3.1. 

PE sends the pre-fetching result tiles with the retrieved tiles for a user’s request to 
CM through PA. When CM receives too many tiles than its storage capacity, CM de- 
termines which space will be replaced with the newly received results by using the 
cache replacement algorithm described in Sect. 3.2. 

When QAE receives the retrieved result, QAE executes the user’s request on the 
retrieved data, and the final result is shown to the user’s browser. 



3 Tile Pre-fetching and Cache Replacement in Web GIS Systems 

3.1 A Probability-Based Tile Pre-fetching Algorithm 

PE determines which tiles should be pre-fetched based on the updated global access 
pattern information according to Algorithm 1. To determine tiles to be pre-fetched, 
PE first updates the global access pattern by using the local access pattern sent from 
PA (step 1). If the number of tiles returned as the result of the request is greater than 
pre-fetch_size, we do not need to pre-fetch, since the request has retrieved more tiles 
than the expectation specified by pre-fetch_size (step 2). Otherwise, we calculate the 
normalized probabilities from T xy to its 4 neighbor tiles that are within distance = 1 
(step 3). If the pre-fetch size is greater than 1, we calculate the transition probabilities 
to those tiles located within distance < pre-fetch_size (step 4~5). At step 6, we sort the 
probabilities in descending order. Then we select top-ranked pre-fetch_size tiles from 
the pre-fetching space (step 7). At step 8, we eliminate the tiles that have been already 
cached at the client by the previous requests. The list of tiles to be pre-fetched is re- 
turned as the result of Algorithm 1. And, Algorithm 1 also returns the own_tile_list 
with the updated transition probabilities, and the result is sent to the CM of client for 
cache replacement. 




132 



Y.-K. Kang, K.-C. Kim, and Y.-S. Kim 



Algorithm 1: Pre-fetching Algorithm Based on Global Access Pattern 

Input: pre-fetch_size, own_tile_list, local_access_pattern, return_tiles including 
central tile T x>y with specified point (a, b) 

Output: list of tiles with transition probabilities to be pre-fetched for pre-fetching, 
own_tile_list with the updated transition probabilities for cache replacement 

Data Structure: transition probability matrix 

1 : Updates the global access pattern by using local_access_pattern; 

2: IF (number of return_tiles > pre-fetch_size) RETURN (NO_PRE-FETCH); 

3: Computes the normalized probabilities from T xy to its 4 neighbors /*distance 

= 1 */; 

4: FOR each tile within distance from 2 to pre-fetch_size DO /* distance > 2*1 

5: Compute the conditional probability of tile moving from T xy to the tile; 

6: Sorts the probabilities of tiles within the pre-fetching space of distance < pre- 

fetch_size; 

7 : Let pre-fetch_lisl = select top-ranked pre-fetch_size tiles within the pre- 

fetching space; 

8: Let pre-fetch_list = pre-fetch_list - own_tile_list; 

9: Resets the transition probabilities of all tiles in { own_tile_list - pre- 

fetch_list } to 0; 

10: RETURN(pre-fetch_list and own_tile_list with the updated transition prob- 
abilities); 



As an example we consider a pre-fetching request that returns tile T xy 1 as the re- 
trieval result. If pre-fetch_size is 0, Algorithm 1 quits with NO_PRE-FETCH at step 2 
after updating the global access pattern by using the local_access_pattern. Otherwise, 
i.e., if pre-fetch_size is greater than 0, step 3 of Algorithm 1 is executed. Assume that 
pre-fetch_size is 3, which means PA want to pre-fetch up to 3 tiles. PE forms the pre- 
fetching space with the distance equal to 3. The pre-fetching space includes tiles that 



1 In general. Rectangle _Region_Query, Circle _Region_Query, Object _Retrieval_Query, and 
Zoom-out commands might return more than one tile as the retrieved result. For simplicity, 
however, we assume that a single tile, T xy , is retrieved. 






Probability-Based Tile Pre-fetching and Cache Replacement Algorithms 



133 



can be accessed within the specified number of tile movements from the retrieved tile, 
T xy . Fig. 2 shows an example of pre-fetching space from T xy with pre-fetch_size = 3. 
Within the pre-fetching space, the four immediate neighbor tiles of T xy , T x ,, T x+ly , 
T ,, and T . can be reached by 1 tile movement from T . Tiles T ,, T , ,, T , , 

x,y-l 7 x-l,y J x,y x,y+2 7 x+l,y+l 7 x+2,y 7 

T , ,, T „ T . ,, T , , and T , , can be reached by 2 tile movements from T . Fi- 

x+l,y-l 7 x,y-2 7 x-l,y-l 7 x-2,y 7 x-l,y+l J x,y 



nally, tiles T xy+3 , T x+ly+2 , T x+2y+1 



’ T*+3a' 5 r ^x+2,y-l’ ^x+l.y^’ ^x,y-3’ T*-l,y-2’ ^x-2,y- 



T T 

P x-3,y 7 A x-l,y+2 7 



T x .j j can be reached by 3 tile movements from T xy . In Fig. 2, the edge from a tile to a 
neighbor tile stands for a tile movement, and the label of the edge means the probabil- 
ity for such a transition. That is, P(x,y— »x,y+l) stands for the probability of tile mov- 
ing from T to T .. 

For tiles that can be reached by 1 tile movement from T xy , we compute the 
normalized probabilities. The normalization of the probabilities is necessary be- 
cause a specified position (a, b) in T xy has an effect to the next tile movement. 
That is, if the specified point (a, b) is near to the upper border, then the user who 
has specified the point is likely to move to the upper tile than the lower one, and 



To explain the normalization process, let’s consider the situation depicted in Fig. 
3. The original probabilities of tile moving from T xy to T x l , T x+ly , T x ,, and T xly are 
P(x,y— >x,y+l), P(x,y— »x+l,y), P(x,y— >x,y-l), and P(x,y— »x-l,y), respectively. The 
specified location by the user is (a,b). Let’s represent the normalized transition prob- 
abilities with distance 1 from T xy as P’(x,y— »x,y+l), P’(x,y— »x+l,y), P’(x,y— »x,y-l), 
and P’(x,y— >x-l,y), respectively. Note that the summation of the normalized prob- 
abilities, P’(x,y— »x+l,y) and P’(x,y— >x-l,y) along the x axis should be same as the 
summation of the original probabilities of P(x,y— >x+l,y) and P(x,y— »x-l,y), and they 
should reflect the internal division ratio of the specified position along the x axis. 
Equations (1) and (2) show the formula for P’(x,y— »x+l,y) and P’(x,y— »x-l,y), re- 
spectively. In equations (1) and (2), for simplicity, we use P rthl and P ( . instead of 
P(x,y— >x+l,y) and P(x,y— »x-l,y), respectively. A similar argument can be made for 
P’(x,y— >x,y+l) and P’(x,y— >x,y-l), and the resulting formulas are in equations (3) and 
(4), respectively. In equations (3) and (4), again for simplicity, we use P a and P fora 
instead of P(x,y— »x,y+l) and P(x,y— »x,y-l), respectively. 




134 



Y.-K. Kang, K.-C. Kim, and Y.-S. Kim 



Distance = 1 



Distance = E 



Distance = 3 





Fig. 3. Normalized probabilities from T xy to Its 4 neighbor tiles 



Probability-Based Tile Pre-fetching and Cache Replacement Algorithms 



135 



_ (Plight + P left) 

1 right — X ixli right 

(lx\P right 1x2 P 'left) 



(i) 



D , _ (Plight + P left) t D 

r left — X ix2r left 

(UP right 1x2 P left) 



( 2 ) 



_ (Pup + Pdown) 

r up — X ly2r up 

(lylP up lylP down ) 



(3) 



D' , _ (Pt‘P + Pdown) j D 

1 down — A iylJT do 

(MP up T lylP down) 



(4) 



To compute the probability of tile moving from T to a tile that is within distance = 
2, we can use the conditional probability computation. As an example, let’s consider 
how to compute the conditional probability of tile moving from T xy to the tile T xy+2 . 
From Fig. 2, we can see that T xy+2 can be reached by two tile movements from T xy 
through T x l . That is, we can reach to T xy+2 from T xy by moving first toT xy+1 and then 
moving toT xy+2 . Thus, we can compute the conditional probability of tile moving from 
T x. y t0 T x, y+2 throu g h T, y+1 by equation (5). 

P(x, y — » x, y + 2) = P'(x, y — > x, y + 1) x P(x, y + 1 — > x, y + 2) 1 ' 

Some tiles can be reached in several ways. For example, T x+1 y+1 can be reached from 
T xy by using two different paths (see Fig. 2): one is T xy — > T xy+1 — > T x+ly+1 , and the other 
is T xy — > T x+ly — > T x+1 In this case, the conditional probability of tile moving from 
T xy to T x+ly+1 can be computed as follows. 

P(x, y — » x + 1, y + 1) = P'(x, y — > x, y + 1) x P(x, y + 1 — » x + 1, y + 1) 1 ’ ' 

+ P'(x, y — > x + 1, y) x P(x + 1, y — > x + 1, y + 1) 

We can do the similar computation for the tiles of distance = 3. The conditional 
probability of tile moving from T xy to T :yy+;) is computed by using equation (7) (see also 

Fig. 2). 




136 



Y.-K. Kang, K.-C. Kim, and Y.-S. Kim 



P(x, y — » x, y + 3) = P(x, y — > x, y + 2) x P(x, y + 2 — » y + 3) 

Also, for T x+ly+2 in Fig. 2, the conditional probability of tile moving from T xy to T x+ly+2 
is computed by using equation (8). 

P(x, y — » x + 1, y + 2) = P(x, y — > x, y + 2) x P(x, y + 2 — > x, y + 3) 



+ P(x, y — > x + 1, y + 1) x P(x + 1, y + 1 — > x + 1, y + 2) 

Generally, the conditional probabilities of tile moving from T xy to T x+ny+m , where the 
maximum distance is |n| + |m|, can be computed as in equation (9). 

P(x,y —>x+n, y+m) = SUM( conditional probabilities of all paths from (9) 

T to T ) 

x,y x+n,y+m' 

After computing the conditional probabilities to all tiles within distance < pre- 
fetch_size from the retrieved tile, T xy , at step 4~5 of Algorithm 1, the list of tiles that 
should be pre-fetched is selected according to step 6~8. To remove the tiles that have 
been saved in the client’s cache database but will not be used in future from the cache 
database for making free cache space for the current request, step 9 of Algorithm 1 
resets the transition probabilities of these tiles to 0. Then, the cache space for these 
tiles can be replaced when CM needs more space for the newly fetched tiles. As the 
result of Algorithm 1 . the list of tiles to be pre-fetched and the own_tile_list with the 
updated transition probabilities are returned. 

After PE pre-fetches the tiles of pre-fetch_list, it returns the retrieved result tiles that 
are actually retrieved for the request, the pre-fetched tiles with the transition prob- 
abilities, and the own_tile_list with the updated transition probabilities to CM through 
PA of the client issuing the request. 



3.2 A Collaborative Cache Replacement Algorithm 

When CM of a client receives the result described above for a user request from PE, 
CM stores both the retrieved tiles and the pre-fetched tiles in the cache. However, 
when it runs out of the cache space, it should remove some tiles to prepare free space 
for newly fetched tiles. To determine which tiles should be removed, the proposed 
cache replacement algorithm (Algorithm 2) utilizes the transition probabilities for 
tiles in own_tile_list, which are already computed for the purpose of tile pre-fetching 
(Algorithm 1). Algorithm 2 selects a set of tiles that have relatively smaller values of 




Probability-Based Tile Pre-fetching and Cache Replacement Algorithms 



137 



transition probability from the current position than others among own_tile_list as the 
victim for cache replacement. 



Algorithm 2: Cache Replacement Algorithm 

Input: retrieved result tiles, pre-fetched tiles, and own_tile_list with the transition 
probabilities 

Data structure: list of cached tiles, size of free cache space 
1 : victim_tile_list = NULL; 

2: required space = retrieved result tiles + pre-fetched tiles; 

3: WHILE (size_of required space > size of free cache space ) DO { 

4: select tile Ty that has the minimum transition probability from own_tile_list; 

5: victim_tile_list + = { Ty }; 

6: own_tile_list - = { Ty }; 

7: size of free cache space + = size_of(Ty); } /* for making enough space */ 

8: remove tiles in victim_tile_list from the cached database; 

9: list of cached tiles - = victim_tile_list; 

10: saves retrieved result tiles and pre-fetched tiles into the cached database; 

11: list of cached tiles += (retrieved result tiles + pre-fetched tiles); 

12: size of free cache space - = size_of(retrieved result tiles + pre-fetched tiles); 
13: RETURN; 



In Algorithm 2, step 3 through step 7 select the set of tiles that should be removed 
from the cached database to make enough free space for storing the retrieved result 
tiles and the pre-fetched tiles. The selected tiles are actually removed from the cached 
database (step 8) and the list of cached tiles (step 9). At step 10, the retrieved tiles and 
the pre-fetched tiles are stored in the cached database. After cache replacement, the 
list of actual cached tiles and the size of free cache space are updated at step 1 1 and 
12, respectively. 



3.3 Effects of the Proposed Tile Pre-fetching and Cache Replacement Algorithms 

To show the effectiveness of the proposed tile pre-fetching and cache replacement 
algorithms, we discuss an example in this subsection. Assume that all tiles are in same 
size and the cached database can store 5 tiles at maximum and PA submits the follow- 
ing query to PE. 






138 



Y.-K. Kang, K.-C. Kim, and Y.-S. Kim 



Point_Query (a, b, pre- f etch_size ( =2 ) ) with 

own_t i 1 e_l i s t ( = { (x,y-l) , (x+l,y-l) , (x,y-2) , (x-l,y-l) } ) , 

local_access_pattern (=NULL) . 

And as the result of the above query, assume tile T xy is returned. The specified 
point (a, b) divides internally the horizontal length of T into the ratio of 6:4 for l xl :l x2 
of Fig. 3. Also the specified point (a, b) divides internally the vertical length of T xy 
into the ratio of 1:9 for l yI :l 2 of Fig. 3. We also assume that the original probabilities 
for tile moving from a tile to its neighbor tiles are as shown in Fig. 4. 



Distance = D 



Distance = 1 



Distance = S 




Fig. 4. A pre-fetching space within distance < 2 from T xy 

According to Algorithm 1 , the global access pattern is first updated by using the lo- 
cal access pattern. However, since the local access pattern is NULL, the global access 
pattern does not change. Since the number of tiles returned as the result of the query is 



Probability-Based Tile Pre-fetching and Cache Replacement Algorithms 



139 



1, and pre-fetch_size(=2) is greater than the number of returned tiles(=l), step 3 of 
Algorithm 1 computes the normalized probabilities of tile moving from T xy to its 4 
neighbor tiles by using equations (1), (2), (3), and (4), respectively. The normalized 
probabilities are shown within parentheses at the four nodes of distance = 1 in Fig. 4. 
And, the conditional probabilities for the tiles of distance = 2 is computed by using 
either equation (5) or equation (6) according to the number of incoming branches into 
the node in Fig. 4, and the computed result is also shown at each node of distance = 2 
with parentheses in Fig. 4. 

Among the nodes in the pre-fetching space of distance < pre-fetch_size(=2), the 
top-ranked 2 tiles are selected for the pre-fetching list. Hence, T xy+1 and T 2 are se- 
lected for actual pre-fetching. The final result for the above Point_Query , therefore, is 
T xy , T x j, and T xy+2 . T xy is transferred to the client as the actual retrieved result of the 
query, and T xy+1 and T xy+I are also transferred to the client as the pre-fetching results. 

When CM of the client issuing the query receives the result, T xy , T xl , and T xy+ „ it 
runs out of free space for caching because 4 tiles, T x l , T x+lyl , T x 2 , and T x ly l have 
been stored at its cached database in which 5 tiles can be stored at maximum, i.e., the 
size of free cache space is 1 . Hence, as the victims for cache replacement, Algorithm 
2 selects T xy , and T xly l among the own_tile_list since these have smaller transition 
probabilities than others. Finally, the cached database has stored T xy , T x l , T xy+2 , T xy l , 
and T x+1 after the complete execution of the above Point jQuery. 

In that case, as long as user moves around these tiles, the communication between 
the client and the server is not needed since the client can serve these user’s requests 
without down loading the additional tiles from the server. However, if these tiles have 
not been pre-fetched, the client should down load these tiles newly, and the user has 
to wait until these tiles are completely fetched into the cached database. So, by using 
the proposed pre-fetching algorithm and the collaborative cache replacement algo- 
rithms, the response time can be remarkably improved in Web GIS systems. 



4 Conclusions 



In this paper, we have proposed an effective tile pre-fetching algorithm that is able to 
determine which tiles are likely to be accessed in the near future according to the 
global access pattern of all users in Web geographical information systems(Web 
GISs). And we have also proposed a collaborative cache replacement algorithm that 
can work with the proposed tile pre-fetching algorithm. The proposed cache replace- 
ment algorithm determines which tile space should be removed from the client’s 
cached database based on the transition probabilities already computed for tile pre- 
fetching. We have modified the architecture of Web GIS systems to accommodate the 




140 



Y.-K. Kang, K.-C. Kim, and Y.-S. Kim 



proposed pre-fetching algorithm with the collaborative cache replacement algorithm 
and showed that the proposed algorithms improved the response time substantially. 

As the future works, we are doing the experimentation to inspect the performance of 
the proposed pre-fetching and cache replacement algorithms through simulation. We 
also plan to make an adaptation of the proposed algorithms into a Web GIS engine. 
Acknowledgement. This work was supported by a grant from INHA University. 



References 



1. R. Laurini and D. Thompson, “Fundamentals of Spatial Information Systems”, ACA- 
DEMIC Press, 1992. 

2. Young-Sub Cho, A Client-side Web GIS Using Tiling Storage Structure and Hybrid Spa- 
tial Query Processing Strategy, Ph. D. Thesis, Dept, of Computer Science and Engineering, 
INHA University, 1999. 

3. Edwardm P. F. Chan and Koji Ueda, “Efficient Query Result Retrieval over the Web", The 
Proceedings of 7 th International Conference on Parallel and Distributed Systems (ICPADS 
00), July 2000. Page 161-170. 

4. K. E. Foote and A. P. Kirvan, “WebGIS, NCGIA Core Curriculum in GIScience”, 
http://www.ncgia.uscb.edu/giscc/units/ul33/ul33.html. December 1997. 

5. Serena Coetzee and Judith Bishop, “A New Way to Query GISs on the Web”, IEEE Soft- 
ware, May/June 1998. 

6. M. V. Liedekerke, A. Jones, and G. Graziani, "The European Tracer Experiment Informa- 
tion System: Where GIS and WWW meet”. The Proceedings of the 1995 ESRI user Con- 
ference, http://www.esri.com/library/userconf/proce95/to050/p022.html 

7. Y. K. Choo and C. Lee, “Integrated Distributed Geographical Information System 
(IDGIS)”, The Proceeding of the 1997 ESRI User Conference, 
http://www.esri.com/library/userconf/proc97/TO 1 50/PAP 1 01/P 10 1 .HTM 

8. D. J. DeWitt, N. Kabra, J. Luo, J. M. Patel, and J. B. Yu, “Client-Server Paradise”, The 
Proceedings of the 20 lh VLDB Conference, 1994. 

9. M. Carey, M. Franlin. Ml Livny, and E. Schekita, “Data Caching Tradeoffs in Client- 
Server DBMS Architecture”, The Proceedings of the ACM SIGMOD, Vol. 20. 1991. 




Optimizing Pattern Queries for Web Access 

Logs* 



Tadeusz Morzy, Marek Wojciechowski, and Maciej Zakrzewicz 

Poznan University of Technology 
Institute of Computing Science 
ul. Piotrowo 3a, 60-965 Poznan, Poland 
Tadeusz . MorzyOput . poznan . pi 
Marek. Wo jciechowskiOcs .put .poznan.pl 
Maciej . ZakrzewiczOcs .put .poznan.pl 



Abstract. Web access logs, usually stored in relational databases, are 
commonly used for various data mining and data analysis tasks. The 
tasks typically consist in searching the web access logs for event sequences 
that support a given sequential pattern. For large data volumes, this 
type of searching is extremely time consuming and is not well optimized 
by traditional indexing techniques. In this paper we present a new index 
structure to optimize pattern search queries on web access logs. We focus 
on its physical structure, maintenance and performance issues. 



1 Introduction 

Web access logs represent the history (the sequences) of users’ visits to a web 
server [11]. Log entries are collected automatically and can be used by admin- 
istrators for web usage analysis [4] [6] [7] [16] [17] [18] [20]. Usually, after some fre- 
quently occurring sequential patterns are discovered [1], the logs are searched 
for access sequences that contain (support) the discovered sequential patterns. 
We will refer to this type of searching as to pattern queries. 

Example web access log is shown in Fig. 1. For each client’s request we store 
the client’s IP address, the timestamp, and the URL address of the requested 
object. In general, several requests from the same client may have identical times- 
tamps since they can represent components of a single web page (e.g. attached 
images). In most cases, web access logs are stored in relational, SQL-&ccessed 
databases. Let us consider the following example of using the relational approach 
to pattern queries. Assume that the relation R(IP, TS, URL) stores web access se- 
quences. Each tuple contains the sequence identifier (IP), the timestamp (TS), 
and the item (URL). Our example relation R describes three web access se- 
quences: {A,B} — > {(7} — > {D}, {A} —>■ {E,C} -a {E}, and {B,C,D} — > {A}. 
Let the searched sequential pattern (subsequence) be: {A} — ► {E} {E}. We 

are looking for all the web access sequences that contain the given sequential 

* This work was partially supported by the grant no. KBN 43-1309 from the State 
Committee for Scientific Research (KBN), Poland. 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 141—154, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




142 



T. Morzy, M. Wojciechowski, and M. Zakrzewicz 



pattern. Fig. 2 gives the relation R and the SQL query, which implements the 
pattern query. 



154.11.231.17 - - [13/ Jul/2000 : 20 : 42 : 25 +0200] "GET / HTTP/1.1" 200 1673 

154.11.231.17 - - [13/ Jul/2000 : 20 : 42 : 25 +0200] "GET /apachepb . gif HTTP/1.1" 200 2326 

192.168.1.25 - - [13/ Jul/2 000 : 20 : 42 : 25 +0200] "GET /demo.html HTTP/1.1" 200 520 

192.168.1.25 - - [13/ Jul/2000 : 20 : 44 : 45 +0200] "GET /books. html HTTP/1.1" 200 3402 

160.81.77.20 - - [13/ Jul/2 000 : 20 : 42 : 25 +0200] "GET / HTTP/1.1" 200 1673 

154.11.231.17 - - [13/ Jul/2000 : 20 : 42 : 25 +0200] "GET /car. html HTTP/1.1" 200 2580 

192.168.1.25 - - [13/ Jul/2000 : 20 : 49 : 50 +0200] "GET /cdisk.html HTTP/1.1" 200 3856 

10.111.62.101 - - [13/ Jul/2000 : 20 : 42 : 25 +0200] "GET /new/demo . html HTTP/1.1" 200 971 




I 

192.168.1.25: /demo.html l<boltitml WKitml 



Fig. 1. Web access log example and a web access sequence 



IP TS URL 


SELECT 


IP 


1 1 
1 1 


A 

B 


FROM 


R Rl, R R2 , R R3 


1 2 


C 


WHERE 


R1 . IP=R2 . IP 


1 3 

2 1 


D 

A 


AND 


R2 . IP=R3 . IP 


2 2 


E 


AND 


Rl . TS<R2 . TS 


2 2 
2 3 


C 

F 


AND 


R2 . TS<R3 . TS 


3 1 


B 


AND 


Rl . URL= ’ A ’ 


3 1 
3 1 


C 

D 


AND 


R2 . URL= ’ E ’ 


3 2 


A 


AND 


R3 . URL= ’ F ’ : 



Fig. 2. The relation of web access sequences and the pattern query 

Since web access logs tend to be very large, there is a problem of appropriate 
optimizing the database access while performing pattern queries, e.g. by means 
of the above SQL query. Database research has developed many indexing tech- 
niques, like H + -trees [5], bitmapped indexes [15], fc-d-trees [3], f?-trees [10], which 
are used to optimize queries based on exact matches of single tuples. However, 
these techniques do not significantly improve pattern queries, which deal with 
partial matches of multi-tuple sequences. There are also proposals for set-based 
indexing [8] [14], which is used to improve subset searching (e.g. find all papers 
containing ’’data mining” and ’’data warehousing” in a keyword list). However, 
these methods work for retrieval of unordered sets of items only. 

In order to realize the shortcomings of the existing indexing methods, let 
us consider applying H + -tree and set-based indexes to execute the query from 
Fig. 2: 

1. Using a H + -tree index, tuples containing all items of each web access se- 
quence are joined first (by IP attribute), and then the verification is done 
whether they contain the given items in the given order. This approach can 
be fairly ineffective since a web access sequence may span across many disk 
block, what results in multiple scanning of each block of the relation. 




Optimizing Pattern Queries for Web Access Logs 143 



2. Using a set-based index, the sequence identifiers (IP attribute) of all se- 
quences, which contain the searched items in any order, are found, and then 
the sequences are read from the relation (perhaps with help of a B + - tree) to 
verify the ordering of their items. This approach gives much better results, 
as compared to a B + - tree index, however, the significant overhead comes 
from reading and verifying the sequences having incorrect ordering. 

In this paper we consider pattern queries on web access log databases. Such 
databases are characterized by relatively small number of items (URLs), which 
occur frequently in various order, and therefore a set-based index is not effi- 
cient. We present a new bitmap-oriented indexing method, which optimizes the 
problem of pattern queries. The basic idea behind our method, as compared to 
set-based indexes, is that the index structure includes not only the items of a 
sequence, but also the ordering of the items. In this way, we reduce the num- 
ber of web access sequences needlessly read from the database, what results in 
shorter query execution time. We performed several experiments, which showed 
the significant improvement over existing indexing methods. 

The structure of the paper is as follows. Section 2 describes the sequential 
index structure and algorithms to create and to use the index. In Sect. 3 we 
present the results of our performance experiments. Section 4 contains final 
conclusions. 



1.1 Basic Definitions and Problem Formulation 

Let L = l\, l 2 , ..., Ik be a set of literals called items (URLs). Web access sequence 
S =< X 1 X 2 ...X n > is an ordered list of sets of items such that each set of items 
Xi C L. Xi is called a sequence element. All items in a sequence element are 
unordered. For short, we will also refer to a web access sequence as to a sequence. 

We say that a web access sequence < XiX 2 ...X n > is contained in another 
web access sequence < Y{Y 2 ..Y m > if there exist integers i\ <i 2 < ... < in such 
that X\ C Y^ 1 ,X 2 C Y, 2 . ..., X n C Y in . 

Problem Formulation. Let D be a database of variable length web access 
sequences. Let 5 be a web access sequence. The problem of pattern queries 
consists in finding in D all web access sequences, which contain the web access 
sequence S. 



1.2 Related Work 

Database indexes provided today by most database systems are R + -tree indexes 
to retrieve tuples of a relation with specified values involving one or more at- 
tributes [5]. Each non-leaf node contains entries of the form (v, p) where v is 
the separator value which is derived from the keys of the tuples and is used to 
tell which sub-tree holds the searched key, and p is the pointer to its child node. 
Each leaf node contains entries of the form (k, p ), where p is the pointer to the 
tuple corresponding to the key k. 




144 



T. Morzy, M. Wojciechowski, and M. Zakrzewicz 



A set-based bitmap indexing, which is used to enable faster subset search 
in relational databases was presented in [14] (a special case of superimposed 
coding). The key idea of the set-based bitmap index is to build binary keys, 
called group bitmap keys , associated with each item set. The group bitmap key 
represents contents of the item set by setting bits to ’1’ on positions determined 
from item values (by means of modulo function). An example set-based bitmap 
index for three item sets: {0, 7, 12, 13}, (2, 4}, and (10, 15, 17} is given in Fig. 3. 
When a subset search query seeking for item sets containing e.g. items 15 and 17 
is issued, the group bitmap key for the searched subset is computed (see Fig. 4). 
Then, by means of a bit-wise AND, the index is scanned for keys containing 
l’s on the same positions. As the result of the first step of the subset search 
procedure, the item sets identified by set= 1 and set=3 are returned. Then, in 
the verification step (ambiguity of modulo function), these item sets are tested 
for the containment of the items 15 and 17. Finally, the item set identified by 
set= 3 is the result of the subset search. Notice that this indexing method does 
not consider items ordering. 



relation 



0 

7 

12 

13 



2 2 _ 

2 4 



3 10^ 

3 15_ 

3 17 



hash keys group bitmap 

keys 



00001 

00100 

00100 

01000 

00100 

10000 

00001 

00001 

00100 



* 

01101 



71 

10100 




00101 



set-based 
bitmap index 



bitmap key 


set 


01101 


1 


10100 


2 


00001 


3 



Fig. 3. Set-based bitmap index 



searched 

subset of items hash keys 



group bitmap 
key 



set-based 
bitmap index 



bitmap key 


set 


01101 


1 


10100 


2 


00101 


3 



bitmap key 


set 


01101 


1 


00101 


3 



I 



verify item sets: 1 ,3 



Fig. 4. Set retrieval using set-based bitmap index 

In [8], a conceptual clustering method, using entropic criterion for conceptual 
clustering EC 3 is used to define indexing schemes on sets of binary features. 
Similar data item sets are stored in the same cluster, and similarity measure 
based on entropy is used during retrieval to find a cluster containing the searched 
subset. The method does not consider items ordering. 





Optimizing Pattern Queries for Web Access Logs 145 



2 Sequential Index Structure 

In this section we present our indexing method, called sequential indexing, for 
optimizing pattern queries. The sequential index structure consists of sequences 
of bitmaps generated for web access sequences. Each bitmap encodes all items 
(similarly to a set-based bitmap index) of a portion of a web access sequence as 
well as ordering relations between each two of the items. 

We start with the preliminaries, then we present the index construction algo- 
rithm and explain how to use the sequential index structure. Finally, we discuss 
index storage and maintenance problems. 

2.1 Preliminaries 

Web access sequences contain categorical items in the form of URLs. For sake 
of convenience, we convert these items to integer values by means of an item 
mapping function. 

Definition 1. An item mapping function fi(x), where x is a literal, is a func- 
tion which transforms a literal into an integer value. 

Example 1. Given a set of literals L = {A, B, C, D, E, F}, an item mapping 
function can take the following values: fi(A)= 1, fi(B)= 2, fi(C)= 3, fi(D)= 4, 

fi(E)= 5, fi(F)= 6- 

Similarly, we use an order mapping function to express web access sequence 
ordering relations by means of integer values. Thus, we will be able to represent 
web access sequence items as well as web access sequence ordering uniformly. 

Definition 2. An order mapping function fo(x,y), where x and y are literals 
and fo(x, y) ^ fo(y,x), is a function which transforms a web access sequence 
< {a:}{j/} > into an integer value. 

Example 2. For the set of literals used in the previous example, an order mapping 
function can be expressed as: fo(x, y) = 6 * fi(x) + fi{y), e.g. fo(C , F) = 24. 

Using the above definitions, we will be able to transform web access sequences 
into item sets, which are easier to manage, search and index. An item set repre- 
senting a web access sequence is called an equivalent set. 

Definition 3. An equivalent set E for a web access sequence S =< A 1 A 2 ... 
X n > is defined as: 



( U {/^z)}] 


U 


/ 

u 


\ 

{f°(x, y)} 


Vi6XiUX 2 U...UX„ / 




X,yex iUX 2 U...UX n 
y x precedes y 


/ 



where: fi() is an item mapping function and fo() is an order mapping function. 




146 



T. Morzy, M. Wojciechowski, and M. Zakrzewicz 



Example 3. For the web access sequence S =< {A 1 B}{C}{D} > and the pre- 
sented item mapping function and order mapping function, the equivalent set E 
is evaluated as follows: 

E = (U*e{A,B,C,D} {/*(*)}) U U x,ye{<{A}{C}>,<{B}{C}>, {fo(x, y)} 

K 7 V <{^}{0}> > <{B}{D}>,<{G}{D}>} 

= {fi(A)}U{fi(B)}U{fi(C)}U{fi(D)}U{fo(A, C)}U{fo(B , C)}U{fo(A, D )} U 
U { fo(B , D)} U {fo(C, D)} = {1, 2, 3, 4, 9, 15, 10, 16, 22} 

Observation. For any two web access sequences Si and S 2 , we have: S 2 contains 
Si if Ei C E 2 , where Ei is the equivalent set for Si, and E 2 is the equivalent 
set for S 2 - In general, this property is not reversible. 

The size of the equivalent set depends on the number of items in the web access 
sequence and on the number of ordering relations between the items. For a 
given number of items in the web access sequence, the equivalent set will be the 
smallest if there are no ordering relations at all (i.e. S =< X >, then \E\ = |X|, 
since E = X), and will be the largest if S is a sequence of one-item sets (i.e. 
S =< XiX 2 -.-X n >, for all i we have |Xj| = 1, then \E\ = n + ( 2 )). 

Since the size of an equivalent set quickly increases while increasing the num- 
ber of the original sequence elements, we split web access sequences into parti- 
tions , which are small enough to process and encode. 

Definition 4. We say that a web access sequence S =< XiX 2 ...X n > is parti- 
tioned into web access sequences Si =< Xi...X ai >, S 2 =< X ai+ i...X a2 > , 

Sk =< X a + i...X n > with level j3 if for each web access sequence Si the size of 
its equivalent set \Ej\ < (3 and for all x, y € Xi U X 2 U ... U X n , where x precedes 
y, we have: either < {a:}{ z/} > is contained in Si or {cc} is contained in Si, and 
{y} is contained in Sj, where i < j ((3 should be greater than maximal item set 
size). 




Example 4- Partitioning the web access sequence S =< {A, B}{C}{D}{A, F} 
{B}{E} > with level 10 results in two web access sequences: Si =< {A,B}{C} 
{D} > and S 2 =< {A, F}{B}{E} >, since the sizes of the equivalent sets are 
respectively: \Ei\ = 9 (Ei = {1,2,3,4,9,15,10,16,22}), and |i? 2 | = 9 (E 2 = 
{1,6,2,5,8,38,11,41,17}). 

Observation. For a web access sequence S partitioned into Si, S 2 , ..., Sk, and 
a web access sequence Q , we have: S contains Q if there exists a partitioning of 
Q into Q 1 , Q 2 , •••, Qm, such that Qi is contained in Si,, Q 2 is contained in S} 2 , 
..., Q m is contained in Si m , and *i < i 2 < ■■■ < im- 

Our sequential index structure will consist of equivalent sets stored for all web 
access sequences, optionally partitioned to reduce the complexity. To reduce 
storage requirements, equivalent sets will be stored in database in the form of 
bitmap signatures. 




Optimizing Pattern Queries for Web Access Logs 147 



Definition 5. The bitmap signature of a set X is an N-bit binary number 
created, by means of bit-wise OR operation, from the hash keys of all data items 
contained in X. The hash key of the item x £ X is an N-bit binary number 
defined as follows: hash.key(X) = 2^ x mod n \ 



Example 5. For the set X = {0, 7, 12, 13}, N = 5, the hash keys of the set items 
are the following: 

hash-key(0) = 2^° mod 5 ) = 1 = 00001, 
hashJkey{ 7) = 2^ 7 mod 5 ) = 4 = 00100, 
hashJkey{ 12) = 2^ 12 mod 5 ) = 4 = 00100, 
hashJtey{\Z) = 2 < 13 mod 5 ) = 8 = 01000. 

The bitmap signature of the set X is the bit-wise OR of all items’ hash keys: 
bitmap si gnature(X) = 00001 OR 00100 OR 00100 OR 01000 = 01101. 

Observation. For any two sets X and Y, if X C Y then: 

bitmap signature(X) AND bitmap signaturefY) = bitmap s ignature(X), 
where AND is a bit-wise AND operator. This property is not reversible in gen- 
eral (when we find that the above formula evaluates to TRUE we still have to 
verify the result traditionally). 



In order to plan the length N of a bitmap signature for a given average set 
size, consider the following analysis. Assuming uniform items distribution, the 
probability that representation of the set X sets k bits to T’ in an IV-bit bitmap 
signature is: 



P = 



( N k)fk,\X\ 

N\ x \ 



where f 0t \ x \ 




(2) 



Example probabilistic expected value of number of bits set to T’ for a 16-bit 
bitmap signatures and various set sizes is illustrated in Fig. 5. We can observe 
that e.g. for a set of 10 items, N should be greater than 8 (else we have all bits 
set to 1 and the signature is unusable since it is always matched). 




Fig. 5. Number of bitmap signature bits set to ‘I’ for various set sizes (N=16) 

The probability that a bitmap signature of the length N having k Vs matches 
another bitmap signature of the length N having m l’s is (™)/(^)- It means 




148 



T. Morzy, M. Wojciechowski, and M. Zakrzewicz 



that the smaller k, the better pruning is performed during matching bitmap 
signatures of item sets, in order to check their containment (so we have to verify 
less item sets). 

2.2 Sequential Index Construction Algorithm 

The sequential index construction algorithm iteratively processes all web access 
sequences in the database. First, the web access sequences are partitioned with 
the given level j3. Then, for each partition of each web access sequence, the equiv- 
alent set is evaluated. In the next step, for each equivalent set, its TV-bit bitmap 
signature is generated and stored in the database. The formal description of the 
algorithm is given below. 

Input: database D of web access sequences, partitioning level /?, bitmap length N 
Output: sequential index for D 

Method: 

for each web access sequence S £ D do begin 

partition S into partitions Si, S 2 , ..., Sk with level /?; 
for each partition S, do begin 
evaluate equivalent set Ei for St; 
bitmapi = bitmap si gnature(Ei); 
store bitmapi in the database; 
end; 
end. 

Consider the following example of sequential index construction. Assume that 
/3=10, TV=16, and the database D contains three web access sequences: Si = 
< {A,B}{C}{D}{A,F}{B}{E} >, S 2 =< {A}{C,E}{F}{B}{E}{A, D} >, 
S 3 =<{B,C,D},{A} >. 

First, we partition the web access sequences with (3= 10. Notice that S 3 is, 
in fact, not partitioned since its equivalent set is small enough. The symbol Sl (jJ 
denotes j-th partition of the z-th web access sequence. 



«Si } i — < { A , B}{C}{D} > (ordering relations are: 


A - 


■> C, B - 


■» C, A- 


A- D, B - 


-¥ D, C - 


*D) 


S 1,2 =< { A , > (ordering relations are: 


A - 


■> B, F - 


> B, A- 


A E, F - 


+ E, B — 


>E) 


S2,i =< {A}{C, E}{F} > (ordering relations are: 


A - 


> E, A - 


1 C, E - 


¥ F, C - 


>F) 




£2,2 — < { -S } { -E 7 } { ^ 4 ., D } > (ordering relations are: 


B - 


■+ E, B - 


-> A, B - 


A D, E - 


-> A, E - 


*D) 



S 3 ,i =< { B , C, D}{A} > (ordering relations are: B -A- A, C -A- A, D -A A ) 



Then we evaluate the equivalent sets for the partitioned web access sequences. 
We use the example item mapping function and order mapping function taken 
from the Definitions 1 and 2. The symbol E i :j denotes the equivalent set for S,;j. 

Ei,i = {1, 2, 3, 4,9, 15, 10, 16, 22} 

E lt2 = {1,6,2,5,8,38,11,41,17} 

E 2 , 1 = {1,3,5,6,11,9,36,24} 




Optimizing Pattern Queries for Web Access Logs 149 



E 2 , 2 = {2, 5, 1, 4, 17, 13, 16, 31, 36} 

£3,1 = {2,3,4,1,13, 19,25} 

In the next step, we generate 16-bit bitmap signatures for all equivalent sets. 



bitmap signature{E\ y \) — 1000011001011111 

bitmapsignature(Ei t 2 ) = 0000101101100110 
bitmap _signature(E 2 = 0000101101111010 
bitmap _signature(E 2 , - 2 ) = 1010000000110111 
bitmap si gnat ure{E^^\) = 0010001000011110 



Finally, the sequential index is stored in the database in the following form: 



SID bitmap_signature 
1 1000011001011111, 0000101101100110 

2 0000101101111010, 1010000000110111 

3 0010001000011110 



2.3 Using Sequential Index for Pattern Queries 

During pattern query execution, the bitmap signatures for all web access se- 
quences are scanned. For each web access sequence, the test of a searched sub- 
sequence mapping is performed. If the searched subsequence can be successfully 
mapped to the web access sequence partitions, then the web access sequence is 
read from the database. Due to the ambiguity of bitmap signature representa- 
tion, additional verification of the retrieved web access sequence is required. The 
verification can be performed using the traditional _B + -tree method, since it con- 
sists in reading the web access sequence from the database and checking whether 
it contains the searched subsequence. The formal description of the algorithm is 
given below. We use a simplified notation of Q[i start.. i -end] to denote a par- 
tition < Xj,_ s tartXi_ s tart+i • • -d^i_end A of a sequence Q — X\X* 2 ...X n A . where 
1 < istart < isnd < n. The symbol & denotes bit-wise AND operation. 

Input: sequential index, searched subsequence Q 
Output: identifiers of web access sequences to be verified 

Method: 

for each sequence identifier sid do begin 

j = i; 

isnd = 1; 

repeat 

istart = isnd'j 

evaluate equivalence set Eq for Q[istart..isnd\; 

mask = bitmap_signature(AQ); 

while mask & bitmap_signature(.Ew,i) <> mask 

and j < number of partitions for sid do j++; 
if j < number of partitions for sid then repeat 
i_end++; 

generate equivalence set Eq for Q[i start.. isnd\\ 




150 



T. Morzy, M. Wojciechowski, and M. Zakrzewicz 



mask = bitmap_signature(FQ); 
until mask & bitmap_signature(-B s i,i i i) <> mask 
or i_end = size of Q\ 

until i_start = i_end or j > number of partitions for sid\ 
if j < number of partitions then return(sfd); 

end. 

Consider the following example of using sequential index to perform pattern 
queries. Assume that we look for all web access sequences, which contain the 
subsequence < {F}{B}{D} >. We begin with sid= 1. We find that < {F} > 
(0000001000000000) matches the first partition (1111101001100001). So, we 
check whether < {F},{.B} > (0010001000000000) also matches this partition. 
Accidentally it does, but when we try < {F},{B}, {D} > (1010101010000000), 
we find that it does not match the first partition. Then we move to the second 
partition to check whether < {£)} > (00001000000000000) matches the parti- 
tion (0110011011010000). This test fails and since we have no more partitions, we 
reject sid=l (this web access sequence does not contain the given subsequence). 

In the next step, we check sid= 2. We find that < {F} > (0000001000000000) 
matches the first partition (0101111011010000). So, we check whether < {F}, 
{B} > (0010001000000000) also matches this partition. It does not, so we move 
to the second partition and find that < {B} > (0010000000000000) matches 
the partition (1110110000000101). Then we must check whether < {B},{D} > 
(1010100000000000) also matches the partition. This time the check is positive 
and since we have matched the whole subsequence, we return sid= 2 as a part of 
the result. The web access sequence will be verified later. 

Finally, we check sid= 3. We find that < {F} > (0000001000000000) does 
not match the first partition (0111100001000100). Since we have no more par- 
titions, we reject sid= 3 (this web access sequence does not contain the given 
subsequence) . 

So far, the result of our index scanning is the web access sequence identified 
by sid= 2. We still need to read and verify, whether the sequence really contains 
the searched subset. In our example it does, so the result is returned to the user. 

2.4 Physical Storage 

Since a sequential index is fully scanned each time a pattern query is per- 
formed, it is critical to store it efficiently. We store index entries in the form 
of < p, n, bitmap i, bitmap 2 , ..., bitmap n >, where p is a pointer to a web ac- 
cess sequence described by the index entry, n is the number of bitmap signa- 
tures, and bitmapi is a single bitmap signature for the web access sequence. 
The pointer p should address the translation table, which contains pointers to 
physical tuples of the relation holding the web access sequences (the structure 
is < n,pi,p 2 , -^Pn >)• Since we usually have a F + -tree index on a sequence 
identifier attribute (to optimize joins), we can use its leaves can as a translation 
table instead of consuming database space by redundant structures. Example 
storage implementation for the sequential index from Sect. 2.2 is given in Fig. 6. 




Optimizing Pattern Queries for Web Access Logs 151 



Sequential Index 




Fig. 6. Example physical storage structure for sequential index 



2.5 Update Operations 

Maintenance of a sequential index is quite expensive, since bitmap signatures are 
not reversible, and updates may influence partitioning of web access sequences. 
For example, when we insert a new tuple into the database, thus extending 
a web access sequence, we cannot determine what partition should the tuple 
belong to. Similarly, when we delete a tuple, then both we cannot determine the 
corresponding partition, and, even if we could do it, we do not know, whether 
the item being deleted was the only item mapped to a given bit of the bitmap 
signature (so we could reset the bit). 

In order to have a consistent state of a sequential index, we must perform the 
complete index creation procedure (partitioning, evaluating equivalent sets, gen- 
erating bitmap signatures) for the web access sequence being modified. However, 
since this solution might reduce DBMS performance for transaction-intensive 
databases, we propose the following algorithm of offline maintenance for se- 
quential indexes: 

1. Whenever a new item is added to an existing web access sequence, we set to 
’1’ all bits in the first bitmap signature for the web access sequence. It means 
that any subsequence will match the first bitmap signature, and therefore we 
will not miss the right one. Any false hits will be eliminated during actual 
verification of subsequence containment. 

2. Whenever an item is removed from an existing web access sequence, we do 
not perform any modifications on the bitmap signatures of the web access 
sequence. We may get false hits, but they will be eliminated during final 
verification. 

Notice that using the above algorithm, the overall index performance may de- 
crease temporarily, but we will not get incorrect query results. Over a period of 
time, the index should be rebuild either completely, or for updated web access 
sequences only, e.g. according to a transaction log. 





152 



T. Morzy, M. Wojciechowski, and M. Zakrzewicz 



3 Experimental Results 



We have performed several experiments on synthetic data sets to evaluate our 
sequential indexing method. The database of web access sequences was generated 
randomly, with uniform item distribution, and stored by Oracle8 DBMS. We 
used dense data sets, i.e. the number of available items was relatively small, 
and therefore each item occurred in a large number of web access sequences. 
The web access sequences contained 1-item sets only (pessimistic approach - 
maximal number of ordering relations). 

Figure 7a shows the number of disk blocks (including index scanning and 
relation access), which were read in order to retrieve web access sequences con- 
taining subsequences of various lengths. The data set contained 50000 web access 
sequences, having 20 items of 50 in average. The compared database accessing 
methods were: traditional SQL query using B + -tree index on IP attribute ( B + - 
tree), 24-bit set-based bitmap index (24S), 32-bit sequential index with (3 = 28 
built on top of 24-bit set-based bitmap index (24S32Q28), and 48-bit sequential 
index with f3 = 55 built on top of 24-bit set-based bitmap index (24S48Q55). 
Our sequential index achieved a significant improvement for the searched subse- 
quences of length greater than 4, e.g. for the subsequence length of 5 we were over 
20 times faster than the B + -tree method and 8 times faster than the set-based 
bitmap index. 

We also analyzed the influence of the partitioning level /3 value on the se- 
quential index performance. Figure 7b illustrates the filtering factor (percentage 
of web access sequences matched) for three sequential indexes built on bitmap 
signatures of total size of 48 bits, but with different partitioning. We noticed 
that partitioning web access sequences into a large number of partitions (small 
/3) results in performance increase for long subsequences, but worsens the per- 
formance for short subsequences. Using a small number of web access sequence 
partitions (high /3) results in more ’’stable” performance, but the performance 
is worse for long subsequences. 



a) 

800000 
700000 
600000 
g 500000 
j» 400000 
o 300000 
■“ 200000 
100000 
0 

2 3 4 5 6 7 8 9 10 11 12 13 
subsequence size 



b) 





Fig. 7. Experimental results 




Optimizing Pattern Queries for Web Access Logs 153 



4 Final Conclusions 

Pattern queries on web access logs are specific in the sense that they require 
complicated SQL queries and database access methods (multiple joins, ineffi- 
cient optimization). In this paper we have presented the new indexing method, 
called sequential indexing, which can replace a i? + -tree indexing and set-based 
indexing. During experiments, we have found that the most efficient solution is 
to combine a set-based index (which checks items of a web access sequence) with 
a sequential index (which checks the items ordering), what results in dramatic 
outperforming B + - tree access methods. 

References 

1. Agrawal R., Srikant R.: Mining Sequential Patterns. Proc. of the 11th ICDE Conf. 
(1995) 

2. Bayardo R.J.: Efficiently Mining Long Patterns from Databases. Proc. of the ACM 
SIGMOD International Conf. on Management of Data (1998) 

3. Bentley J.L.: Multidimensional binary search trees used for associative searching. 
Comm, of the ACM 18 (1975) 

4. Catledge L.D., Pitkow J.E.: Characterizing Browsing Strategies in the World Wide 
Web. Proc. of the 3rd Int’l WWW Conference (1995) 

5. Comer D.: The Ubiquitous B-tree. Comput. Surv. 11 (1979) 

6. Cooley R., Mobasher B., Srivastava J.: Data preparation for mining World Wide 
Web browsing patterns. Journal of Knowledge and Information Systems 1 (1999) 

7. Cooley R., Mobasher B., Srivastava J.: Grouping Web Page References into Trans- 
actions for Mining World Wide Web Browsing Patterns. Proc. of the 1997 IEEE 
Knowledge and Data Engineering Exchange Workshop (1997) 

8. Diamantini C., Panti M.: A Conceptual Indexing Method for Content-Based Re- 
trieval. Proc. of the 15th IEEE Int’l Conf. on Data Engineering (1999) 

9. Guralnik V., Wijesekera D., Srivastava J.: Pattern Directed Mining of Sequence 
Data. Proc. of the 4th KDD Conference (1998) 

10. Guttman A.: R-trees: A dynamic index structure for spatial searching. Proc. of 
ACM SIGMOD International Conf. on Management of Data (1984) 

11. Luotonen A.: The common log file format. http://www.w3.org/pub/WWW/ (1995) 

12. Mannila H., Toivonen H.: Discovering generalized episodes using minimal occur- 
rences. Proc. of the 2nd KDD Conference (1996) 

13. Mannila H., Toivonen H., Verkamo A.I.: Discovering frequent episodes in sequences. 
Proc. of the 1st KDD Conference (1995) 

14. Morzy T., Zakrzewicz M.: Group Bitmap Index: A Structure for Association Rules 
Retrieval. Proc. of the 4th KDD Conference (1998) 

15. O’Neil P.: Model 204 Architecture and Performance. Proc. of the 2nd International 
Workshop on High Performance Transactions Systems (1987) 

16. Perkowitz M., Etzioni O.: Adaptive Web Sites: an AI challenge. Proc. of the 15th 
Int. Joint Conf. AI (1997) 

17. Pirolli P., Pitkow J., Rao R.: Silk From a Sow’s Ear: Extracting Usable Structure 
from the World Wide Web. Proc. of Conf. on Human Factors in Computing Systems 
(1996) 

18. Pitkow J.: In search of reliable usage data on the www. Proc. of the 6tli Int’l 
WWW Conference (1997) 




154 



T. Morzy, M. Wojciechowski, and M. Zakrzewicz 



19. Srikant R., Agrawal R.: Mining Sequential Patterns: Generalizations and Perfor- 
mance Improvements. Proc. of the 5th EDBT Conference (1996) 

20. Yan T.W., Jacobsen M., Garcia-Molina H., Dayal U.: From User Access Patterns 
to Dynamic Hypertext Linking. Proc. of the 5tli Int’l WWW Conference (1996) 




Ensemble Feature Selection Based on Contextual Merit 
and Correlation Heuristics 1 



Seppo Puuronen, Iryna Skrypnyk, and Alexey Tsymbal 



Department of Computer Science and Information Systems, University of Jyvaskyla 
P.O. Box 35, FIN-40351 Jyvaskyla, Finland 
{ sepi , iryna , alexey }@cs . jyu . f i 



Abstract. Recent research has proven the benefits of using ensembles of 
classifiers for classification problems. Ensembles of diverse and accu- 
rate base classifiers are constructed by machine learning methods ma- 
nipulating the training sets. One way to manipulate the training set is to 
use feature selection heuristics generating the base classifiers. In this 
paper we examine two of them: correlation-based and contextual merit 
-based heuristics. Both rely on quite similar assumptions concerning 
heterogeneous classification problems. Experiments are considered on 
several data sets from UCI Repository. We construct fixed number of 
base classifiers over selected feature subsets and refine the ensemble it- 
eratively promoting diversity of the base classifiers and relying on 
global accuracy growth. According to the experimental results, contex- 
tual merit -based ensemble outperforms correlation-based ensemble as 
well as C4.5. Correlation-based ensemble produces more diverse and 
simple base classifiers, and the iterations promoting diversity have not 
so evident effect as for contextual merit -based ensemble. 



1 Introduction 

Machine learning research has progressed in many directions. One of those directions 
still containing a number of open questions is the construction and use of an ensemble 
of classifiers. Ensembles are well established as a method for obtaining highly accu- 
rate combined classifiers by integrating less accurate base classifiers. 

Many methods for constructing ensembles have been developed. They can be di- 
vided into two main types: general methods and methods specific to a particular 
learning algorithm. Amongst successful general ones are sampling methods, and 
methods manipulating either the input features or the output targets. We apply the 
second type of method when ensemble integration is made using weighted voting. 

The goal of traditional feature selection is to find and remove features that are un- 
helpful or destructive to learning in order to construct a single classifier [4], Since an 



1 This research is partially supported by the Academy of Finland (project #52599), the Centre 
for International Mobility (CIMO) and the COMAS Graduate School of the University of 
Jyvaskyla. 

A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 155-168, 2001. 

© Springer-Verlag Berlin Heidelberg 2001 




156 S. Puuronen, I. Skrypnyk, and A. Tsymbal 



ensemble of classifiers represents multiple inference models, it can produce different 
inference models in different sub areas of the instance space. Feature selection heuris- 
tics are used to assist in guiding the process of constructing base classifiers imple- 
menting those models. In addition to decreasing the number of features, ensemble 
feature selection has an additional goal of finding multiple feature subsets to produce 
a set of base classifiers promoting disagreement among them [15]. 

This paper continues our research in ensemble feature selection [18,20]. We have 
developed an algorithm with an iterative refinement cycle. This cycle provides feed- 
back to ensemble construction in order to improve ensemble characteristics with 
selected feature selection heuristics. In this paper we examine two feature selection 
heuristics for ensemble creation applying also the refinement cycle. 

For our study we chose two heuristics for feature selection that have an ability to 
treat complex interrelations between features, namely contextual merit-based heuristic 
and correlation-based heuristic [1,7,9,16]. Interrelations between features that are 
assumed in those heuristic might be a consequence of the fact, that some domains 
contain features varying in importance across the instance space [1,7,10,12,16]. This 
situation, called feature-space heterogeneity, is wide spread for real data sets. Hetero- 
geneity in data becomes critical, because most classification algorithms fail to make 
accurate predictions. 

We compare correlation-based and contextual merit -based heuristic on several 
data sets from UCI repository and make some conclusions about the use of those 
heuristics for ensemble feature selection. Particularly, we observe co-ordination of 
feature selection heuristics and internal design of the refinement cycle. 

In Chapter 2 we consider the basic framework of ensemble classification. Chapter 
3 describes the Contextual Merit measure (CM measure) and correlation-based merit 
measure. Chapter 4 presents brief description of the algorithm for ensemble feature 
selection with the iterative refinement cycle. In Chapter 5 our experimental study on 
several data sets is described. We summarize with conclusions in Chapter 6. 



2 Ensemble Classification 

In supervised learning, a learning algorithm is given training instances of the form 
{(Xj, ..., (x M , y M )} for some unknown function y = g(x), where x ; values are vec- 

tors of the form (x n , ..., x y , ..., x. w ), where x tj are feature values of x_, and M is the 
size of the training set T. In this paper we will refer to features as f,j= 1 . . .N, where 
N is the number of features. Given a set of training instances T, a learning algorithm 
outputs a classifier h, which is a hypothesis about the true function g. Given new x 
values, it predicts the corresponding y values. In ensemble classification (Fig. 1) a set 
of base classifiers h v .... h s , where S is the size of the ensemble is formed during the 
learning phase. 

Each base classifier in the ensemble (classifiers /; ... h s in this case) is trained us- 
ing training instances of the corresponding training set Ti, i = 1, . .., S. In this paper 
we form the training sets T,, .... T s as the subsets of features using feature selection 
heuristics. For the ensemble classification the classifications of the base classifiers are 




Ensemble Feature Selection Based on Contextual Merit and Correlation Heuristics 



157 



combined during the application phase in some way h* = F(h v h 2 , . .., h s ) to produce 
the final classification of the ensemble. In this paper the final classification y* is 
formed using weighted voting [2, 3, 5, 8]. 




Fig. 1 . Ensemble classification process 

Research has shown that an effective ensemble should consist of a set of base clas- 
sifiers that not only have high accuracy, but also make their errors on different parts 
of the input space as well [5,14]. 

Combining the classifications of several classifiers is useful only if there is dis- 
agreement among the base classifiers, i.e. they are independent in the production of 
their errors. The error rate of each base classifier should not exceed a certain limit. 
Otherwise, the ensemble error rate will usually increase as a result of combination of 
their classifications [5]. The measure of independence of the base classifiers is called 
the diversity of an ensemble. Several ways to calculate the diversity are considered in 
[6,15,17], 



3 Heuristics for Ensemble Feature Selection 

In this chapter we consider two heuristics for feature selection, namely the contextual 
merit measure and correlation-based merit measure which are intended to treat fea- 
ture-space heterogeneity in classification problems [1], Both of them are independent 
on the learning algorithm. 



3.1 Contextual Merit Measure 

The main assumption of CM measure, which was developed in [9] is that features 
important for classification should differ significantly in their values to predict in- 
stances from different classes. The CM measure is robust to both problems of class 
















158 S. Puuronen, I. Skrypnyk, and A. Tsymbal 



heterogeneity and feature-space heterogeneity. The CM measure assigns a merit value 
to a feature taking into account the degree to which the other features are capable to 
discriminate between the same instances as the given feature. In an extreme situation, 
when two instances of different classes differ in only one feature than that feature is 
particularly valuable for classification and it is assigned additional merit. 

We use the CM measure as it has been presented in [9] and described below. Let 

the difference d x x between the values of a categorical feature/) for the vectors x, and 

\ k be 

a i 0, if the values are same (1) 

dJ‘ = 

'• 4 1, otherwise. 



and between the values of a numeric feature f correspondingly be 



d 



(/;) 

x f ,x t 



= mini 



x u~ x *j 




( 2 ) 



where x is the value of the feature f. in the vector x , and t , is a threshold. In this 

V J J 1 Jj 

paper it is selected to be one-half of the range of the values of the feature/. 

Then the distance D x between the vectors x. and x t is 



n = ,/ ( L> 

^ f X i ,X k ^X,,X fc 



(3) 



j = 1 

where N is the number of features. The value of CM measure CM f of a feature/ is 



CM, = 



(Si) ,(/,) 



i=l x t eC(x,-) 



(4) 



where M is the number of instances, C(x, ) is the set of vectors not from the same 
class as the vector x., and vv x x is a weight chosen so that instances that are close to 

each other, i.e., that differ only in a few of their features, have greater influence in 

(f) 2 

determining each feature's CM measure value. In [9] weights w x ' x = 1 / D x % were 
used when x k is one of the K nearest neighbors of x., in terms of D x x , in the set 
C(x, ), and w'J x . = 0 otherwise. The number of nearest neighbors k used in [9] was 
the binary logarithm of the number of examples in the set C(x, ) . 



3.2 Correlation-Based Merit Measure 

Correlation-based approach is the other widespread approach to estimate interrela- 
tions between features, or features and the class variable. It uses the Pearson’s 
correlation coefficient as a measure of linear dependence between two variables. Such 
kind of dependence is quite common in real world situations. 




Ensemble Feature Selection Based on Contextual Merit and Correlation Heuristics 



159 



Many researchers have used the correlation-based approach to estimate the good- 
ness of feature subsets. For example, [16,21] used correlation between particular 
feature and particular class, and features with highest correlation were selected to 
construct a classifier for that class separately producing different models for different 
classes. We will estimate the goodness of feature subset as it was done in [7], The 
correlation -based merit measure will be calculated not for a particular feature, but for 
a particular feature subset. In that way we will be able to consider interrelations be- 
tween features in the subset, too. The basic assumption for this heuristic is as follows. 
Good feature subsets contain features highly correlated with the class, yet uncorre- 
lated with each other. The correlation coefficient r f fi between two numerical features 

f. and /! is calculated using the formula (5). 

S flft ( 5 ) 

SfjSf„ 



where S f fi is a sample covariance between features/! and/j, and S f and S ft are sam- 
ple standard deviations for features/! and/ t correspondingly. 

According to [7] we calculate the merit of a feature subset M F with the formula 

(6). 




where F is a feature subset containing n features, y is a class variable, rfy is the 
mean feature-class correlation {f e F), and is the average feature-feature inter- 
correlation. The numerator in formula (6) gives an indication of how predictive for 
the class the subset of features is, and the denominator expresses how much redun- 
dancy there exist. Formula (5) is suitable only for the case of numeric variables. For 
categorical values binarization is used. 

Let fj be a categorical feature having t values v p ..., v,. We form t binary attributes 
fj, i = 1, ..., t so that fj =1, when/! = v, and fj =0 otherwise. The correlation 

between the categorical feature f and the numeric feature f is then calculated using 
the formula (7). 

'77, = /'(/ -',) 'V:/, (?) 

i 



Let f be a categorical feature having t values v p ..., v, and f k be a categorical feature 
having l values u,, ..., u r The binary attributes fj , i= 1, . .., t and f^,cj= 1, .... / are 

formed as described above. The correlation between these two categorical features is 
then calculated using the formula (8). 




160 



S. Puuronen. I. Skrypnyk, and A. Tsymbal 



/ 



r fjfi ~ 



p(fj=Vi,f k = uj r f , f . 



In addition, the two formulae above are robust to missing values [7], 



(B) 



4 An Algorithm for Ensemble Feature Selection 



In this chapter we present an overview of our algorithm EFS_ref for ensemble fea- 
ture selection with the iterative refinement cycle. An algorithm for ensemble feature 
selection with contextual merit-based heuristic was considered in detail also in [18]. 
In this paper we extend it to the correlation-based heuristic, too. For the two heuristics 
the algorithm constructs an ensemble of a fixed number of base classifiers over se- 
lected feature subsets. The number of base classifiers in both ensembles is the number 
of different classes among the instances of the training set. The objective is to build 
each base classifier on the feature subset including features most relevant for distin- 
guishing the corresponding class from the others. It is necessary to note that in this 
algorithm the base classifiers are still not binary, but distinguish the whole number of 
classes present in the data set. Each base classifier of the initial ensemble is based on 
a fixed number of features with the highest value of the CM measure for the corre- 
sponding class, or with the highest correlation-based measure correspondingly. The 
initial ensemble is iteratively modified trying to change the number of features one by 
one for the less diverse base classifiers suggesting exclusions or inclusions of fea- 
tures. With the correlation-based measure candidate feature subsets for new base 
classifiers are formed using forward inclusion procedure. The iterations are guided by 
the diversity of classifiers and the value of the CM measure, or correlation-based 
merit measure depending on the heuristic used. 

For our algorithm we use the diversity measure calculated as an average difference 
in predictions between all pairs of classifiers. The diversity is calculated similarly as 
in [17] where it was used as a measure of independence of the base classifiers. The 
modified formula to calculate the approximated diversity Div, of a classifier h. is (9). 



AMi Uif{h l {xj\h k {xj'j) 
M (5-l) 



where S denotes the number of the base classifiers, hfx) denotes the classification of 
the vector x. by the classifier h,, and DiJ{a,b) is zero if the classifications a and b are 
same and one if they are different, and M is the number of instances in the test set. 
The diversity of an ensemble is calculated as the average diversity of all the base 
classifiers (10). 



Div= 



S 




Ensemble Feature Selection Based on Contextual Merit and Correlation Heuristics 



161 



The CM measure is calculated as in [9] and described in Sect. 3.1. The correlation 
measure is calculated as described in Sect. 3.2. 

Our algorithm is composed of two main phases: 1) the construction of the initial 
ensemble and 2) the iterative development of new candidate ensembles. Let DS be a 
data set including instances {(x 1( yi),..., (x M , y M )}, where x ; = (x ;l , ..., x iN ) is the 
vector including the values of each of the N features. The main outline of our algo- 
rithm is presented below. 

Algorithm EFS_ref (DS) 

DS the whole data set 

TRS training set 

VS validation set used during the iterations 

TS test set 

Ccurr the current ensemble of base classifiers 

Accu the accuracy of the final ensemble 

FS set of feature subsets for base classifiers 

Threshold Threshold value used to select the features for the 
initial ensemble 

begin 

divide_instances (DS , TRS , VS , TS ) {divides DS into TRS, 

VS, and TS using stratified random sampling} 

for Heuristics {CM, Corr} 

Ccurr =build_initial_ensemble (Heuristic, TRS , FS , 

Threshold) 

loop 

cycle (Heuristic, TRS, Ccurr, FS, VS) 

{developes candidate ensembles and updates 
Accu, Ccurr , and FS when necessary} 
until no_changes 
Accu=accuracy (TS , Ccurr) 
end for 

end algorithm EFS_ref 



In the £,7uS'_r<?/ algorithm ensembles are generated in two places, the initial ensembles 
are generated using a different procedure than the generation of the new candidate en- 
sembles that is included in the procedure cycle. One of the two initial ensembles is con- 
structed using CM-based merit values and the other one using correlation-based merit 
values. In all cases the features with the highest normalized merit values up to the thresh- 
old are selected and the base classifiers are built using the C4.5 learning algorithm. The 

threshold is fixed (e {0. 1,0.2, ,0.9}) in advance so that the accuracy of the ensemble 

over the training set is the highest one with the weighted voting. The main outline of the 
initial ensemble algorithm is presented below. The threshold value is used to cut the 
interval of merit values in order to select the features. For example the threshold value 
0. 1 means that only those features whose merit values are in the interval of the highest 
10% of the whole interval of the merit values are selected. 

build_initial_ensemble (Heuristic , TRS , FS , Threshold) 

L number of classes and number of base classifiers 

begin 

Ensemble=0 ; FS=0 




162 S. Puuronen, I. Skrypnyk, and A. Tsymbal 



for i from 1 to L 

MERITS [i] =calculate_merits (Heuristic , TRS , i ) 

{feature merits for class i} 

FS [i] =select_f eatures (MERITS , Threshold) {selects 
features with the highest merits using threshold} 
C [i] =C4 . 5 (TRS , FS [i] ){ learns classifier i} 

Ensemble=EnsembleuC [i] ; FS=FSUFS [i] 
end for 

end build initial ensemble=Ensemble 



The iteration mechanism used in this paper changes one base classifier in each it- 
eration. The base classifier with the smallest diversity is taken as the potential classi- 
fier to be changed. For the potential classifier, one feature is tried to be added or de- 
leted from the subset of features that was used to train the classifier. The feature is 
selected using the merit value. The outline of the iteration algorithm is described 
below. 

cycle (Heuristic, TRS, Ccurr, FS, VS) 

begin 

loop 

for i from 1 to L 

DIV [i] =calculate_diversities (TRS , Ccurr) 
end for 

Cmin= argmin DIV[i] 

NewCcurr=Ccurr\C [Cmin] ;NewFS=FS 

MERITS [Cmin] =calculate_merits (Heuristic , TRS, Cmin) 

NewFS [Cmin] =FS [Cmin] \feature with min MERITS [Cmin] 
included in FS [Cmin] 

NewCcurr=NewCcurruC4 . 5 (TRS, NewFS [Cmin] ) 
if accuracy (VS , NewCcurr) >=accuracy (VS , Ccurr ) 
then Ccurr=NewCcurr; FS=NewFS 
else no changes to Ccurr and FS 
until no change 
loop 

for i from 1 to L 

DIV=calculate_diversities (TRS , Ccurr) 
end for 

Cmin= argmin DIV[i] 

NewCcurr=Ccurr\C [Cmin] ;NewFS=FS 

MERITS [Cmin] =calculate_merits (Heuristic , TRS, Cmin) 

NewFS [Cmin] =FS [Cmin] ufeature with max MERITS [Cmin] 
not included in FS [Cmin] 

NewCcurr=NewCcurruC4 . 5 (TRS, NewFS [Cmin] ) 
if accuracy (VS , NewCcurr) >=accuracy (VS , Ccurr ) 
then Ccurr=NewCcurr; FS=NewFS 
else no changes to Ccurr and FS 
until no change 
end cycle 



The iteration mechanism is composed of two loops. Both loops try to replace one 
base classifier at a time. The base classifier, which is tried to be replaced is the one 
with the lowest diversity value. When the classifier to be changed has been selected, 
one feature is either added (second loop) or deleted (first loop) from the subset of 




Ensemble Feature Selection Based on Contextual Merit and Correlation Heuristics 



163 



features that was used to train the classifier. The feature suggested to be added or 
deleted is selected using merit value of the corresponding heuristic. The accuracy of 
the previous ensemble is compared with the accuracy of the changed ensemble and the 
change is accepted if the first one is not higher. Both loops end when there were no 
changes accepted during the whole cycle. 



5 Experiments 



In this chapter, experiments with our algorithm for generation of an ensemble of 
classifiers built on different feature subsets are presented. First, the experimental 
setting is described, and then, results of the experiments are presented. The experi- 
ments are conducted on ten data sets taken from the UC1 machine learning repository 
[13]. These data sets were chosen so as to provide a variety of application areas, sizes, 
combinations of feature types, and difficulty as measured by the accuracy achieved 
on them by current algorithms. 

For each data set 30 test runs are made. In each run the data set is first split into the 
training set and two test sets by stratified random sampling keeping the class distribu- 
tion of instances in each set approximately the same as in the initial data set. The 
training set (TRS) includes 60 percent of instances and the test sets (VS and TS) both 
20 percent of instances. The first test set, VS (validation set) is used for tuning the 
ensemble of classifiers, adjusting the initial feature subsets so that the ensemble accu- 
racy becomes as high as possible using the selected heuristic. The other test set, TS is 
used for the final estimation of the ensemble accuracy. The base classifiers them- 
selves are learnt using the C4.5 decision tree algorithm with pruning [19] and the test 
environment was implemented within the MLC++ framework [11]. 

Our aim is to analyze the contribution of the iterative refinement cycle for the CM- 
and correlation-based ensembles, and then compare their accuracy with C4.5 accu- 
racy. We also examine which of two feature selection heuristics produces more di- 
verse and more accurate ensembles. 



Table 1. Accuracies (%) of the CM-based and correlation-based ensembles, and C4.5 



Data set 


CM -based 


Correlation -based 


C4.5 


before 


after 


threshold 


before 


after 


threshold 


Car 


86.6 


87.9 


0.5 


86.0 


86.8 


0.6 


87.9 


Glass 


64.8 


65.2 


0.1 


65.7 


66.0 


0.3 


62.6 


Iris 


94.6 


94.3 


0.9 


94.1 


94.2 


0.7 


93.9 


LED_17 


64.4 


64.6 


0.1 


59.5 


62.1 


0.4 


65.0 


Lymph 


76.4 


75.3 


0.5 


74.0 


73.6 


0.6 


74.2 


Thyroid 


91.4 


92.7 


0.6 


93.4 


93.1 


0.3 


92.8 


Wine 


93.5 


93.5 


0.2 


92.3 


93.4 


0.7 


92.9 


Waveform 


73.0 


74.2 


0.2 


74.3 


74.2 


0.5 


72.2 


Vehicle 


67.0 


67.3 


0.1 


67.6 


68.1 


0.5 


68.7 


Zoo 


92.2 


93.5 


0.1 


87.0 


88.1 


0.5 


92.4 



We collected accuracies (Table 1) and diversities (Table 2) for these two ensem- 
bles on each data set before and after iterations marking the corresponding columns 
as CM_b and CM_a for CM-based ensemble, and CR_b and CR_a for correlation- 





164 S. Puuronen, I. Skrypnyk, and A. Tsymbal 



based ensemble. Thresholds used for each data set are presented in the corresponding 
columns for CM-based and correlation-based ensembles. 



Table 2. Diversities of the base classifiers before and after iterations for the CM-based and 
correlation-based ensembles 



Data set 


Average diversity of the base classifiers, 
(before and after iterations) 


Diver- 

sity 

diff. 


Fea- 

tures 


Aver 
num of 
select, 
feat. 


Aver 
num of 
itera- 
tions 


Car 


CM_b 


12.6 


12.3 


15.9 


16.2 


7.3 


5 


4.225 


1.5 


CM_a 


24.5 


20.5 


20.2 


21.1 


CR_b 


13.6 


13.6 


17.2 


16.8 


6.925 


3.217 


1.6 


CR_a 


27.3 


23.5 


19.3 


18.8 


Glass 


CM_b 


20.8 


21.7 


24.1 


28.2 


35.6 


32.0 


7.5 


9 


4.611 


1.367 


CM_a 


33.0 


31.6 


32.1 


35.1 


41.4 


36.2 


CR_b 


30.4 


44.8 


31.4 


31.3 


40.5 


28.8 


3.52 


3.328 


1.3 


CR_a 


35.6 


46.1 


35.1 


35.3 


42.0 


34.2 


Iris 


CM_b 


3.6 


1.8 


1.9 


1.83 


4 


1.356 


1.067 


CM_a 


5.7 


3.3 


3.8 


CR_b 


22.3 


44.4 


22.3 


1.23 


1.678 


1 


CR_a 


24.2 


45.2 


23.3 


LED_17 


CM_b 


24.6 


39.8 


34.5 


24.2 


24.2 


21.8 


23.9 


30.6 


34.2 


21.4 


5.81 


24 


22.367 


1.5 


CM_a 


31.3 


44.0 


36.5 


31.7 


32.7 


30.5 


30.5 


35.4 


37.0 


27.7 


CR_b 


58.3 


59.5 


73.5 


57.7 


82.2 


66.6 


62.3 


63.9 


55.0 


54.1 


1.35 


4.533 


1.6 


CR_a 


60.3 


60.6 


73.9 


60.5 


81.8 


68.5 


62.7 


66.2 


56.7 


55.2 


Lymph 


CM_b 


18.8 


15.1 


17.7 


16.3 


7.75 


18 


6.625 


1.6 


CM_a 


28.3 


22.2 


24.0 


24.4 


CR_b 


25.1 


21.4 


21.0 


38.2 


2.475 


5.35 


1.267 


CR_a 


27.9 


24.6 


24.2 


38.9 


Thyroid 


CM_b 


6.2 


6.4 


9.3 


2.17 


5 


1.844 


1.4 


CM_a 


9.3 


8.5 


10.6 


CR_b 


14.7 


12.4 


11.8 


1.2 


2.944 


1.2 


CR_a 


15.8 


13.8 


12.9 


Vehicle 


CM_b 


24.6 


23.1 


24.6 


24.6 


3.875 


18 


12.258 


1.5 


CM_a 


29.8 


28.7 


27.1 


27.8 


CR_b 


39.1 


43.0 


45.5 


43.5 


0.85 


6.492 


1.433 


CR_a 


40.4 


43.7 


46.0 


44.4 


Wine 


CM_b 


16.7 


15.7 


17.5 


1.43 


12 


6.7 


1.3 


CM_a 


18.8 


17.7 


17.8 


CR_b 


16.2 


18.9 


16.6 


3.07 


3.944 


1.3 


CR_a 


19.7 


21.8 


19.4 


Wave- 

form 


CM_b 


18.8 


21.5 


19.7 


3.67 


21 


17.611 


1.267 


CM_a 


22.8 


24.4 


23.8 


CR_b 


40.3 


39.0 


38.8 


1.33 


8.289 


1.467 


CR_a 


41.4 


40.0 


40.1 


Zoo 


CM_b 


12.6 


13.2 


12.3 


13.9 


14.2 


13.8 


11.9 


2.57 


16 


13.252 


1.2 


CM_a 


15.7 


15.9 


14.9 


16.3 


17.0 


15.3 


14.8 


CR„b 


37.5 


32.9 


37.1 


44.0 


40.3 


34.1 


30.2 


2.66 


3.867 


1.367 


CR_a 


39.5 


36.1 


39.8 


44.3 


42.2 


36.9 


35.9 



In Table 2 for each data set the diversities of all base classifiers are presented. In 
the sub-columns of the second column diversities of the base classifiers are presented. 
The number of the base classifiers is equal to the number of classes for each data set. 
In the third column the average difference of diversity before and after iteration is 
recorded. The forth column represents the number of features in corresponding data 





Ensemble Feature Selection Based on Contextual Merit and Correlation Heuristics 



165 



set, and the fifth column show the number of features used to construct the base classi- 
fiers. The last column presents the average number of iterations in the refinement 
cycle. 

In order to make conclusions following our aims from data tables above we calcu- 
lated statistics in Table 3. We applied the paired sign test in order to estimate the 
number of wins, looses, and ties for each pair of compared classification results, and 
the paired t-test in order to conclude if the difference was statistically significant. The 
paired t-test is more strict than the sign test, but it assumes that population follows the 
Gaussian distribution. We calculated statistics for both tests over 30 runs, and tested if 
the distribution is normal using standardized skewness and standardized kurtosis. 2 
The second and third columns in Table 3 represent statistics for comparison of two 
heuristic-base ensembles before and after iterations. The forth and fifth columns 
summarize the difference in accuracy before and after iterations for CM-based and 
correlation-based ensembles. The sixth and seventh columns represent statistics for 
comparison of both heuristic-based ensembles after refinement cycle versus the C4.5 
algorithm. The sub-columns indicate trial for the sign test, and P value for the paired 
t-test. Negative results of normal distribution test are outlined. Both statistical tests 
have been done using 95% confidence interval. Average calculations in the last row 
of the table have been done over ten data sets using their average accuracy from Ta- 
ble 1. 



Table 3. Comparisons of the accuracies obtained by CM-based and correlation-based ensem- 
bles, and the accuracies of C4.5 by the sign test and the paired t-test 



Data set 


CM_b vs CR_b 


CM_a vs CR_a 


CM_a vs CM_b 


CR_a vs CR_b 


CM_a vs C4.5 


CR_a vs C4.5 


win, tie, 
loose 


P value 
2-tailed 


win, tie, 
loose 


P value 
2-tailed 


win, tie, 
loose 


P value 
1 -tailed 


win, tie, 
loose 


P value 
1 -tailed 


win, tie, 
loose 


P value 
2-tailed 


win, tie, 
loose 


P value 
2-tailed 


Car 


16,2,12 


0.1052 


19,3,8 


0.0136 


20,5,5, 


0.0001 


20,4,6 


0.0013 


11,6,13 


1.0000 


5,5,20 


0.0009 


Glass 


10,5,15 


0.5249 


12,5,13 


0.4998 


9,14,7 


0.2933 


12,14,4 


0.3048 


19,6,5 


0.0156 


21,3,6 


0.0021 


Iris 


6,19,5 


0.3219 


5,19,6 


0.8455 


2,26,2 


0.2244 


3,25,2 


0.1862 


6,21,3 


0.3544 


5,23,2 


0.2591 


LED_17 


19,7,4 


<0.0001 


16,4,10 


0.0376 


8,14,8 


0.4083 


19,9,2 


<0.0001 


11,5,14 


0.4872 


7,1,22 


0.0033 


Lymph 


14,7,9 


0.0876 


14,6,10 


0.1152 


8,12,10 


0.1669 


6,18,6 


0.2930 


16,4,10 


0.2833 


8,11,11 


0.4999 


Thyroid 


7,6,17 


0.0018 


7,16,7 


0.4171 


16,11,3 


0.0011 


6,19,5 


0.2339 


10,12,8 


0.7375 


11,13,6 


0.5634 


Vehicle 


14,2,14 


0.4682 


12,2,16 


0.2175 


14,7,9 


0.2106 


12,10,8 


0.1732 


8,4,18 


0.0138 


13,2,15 


0.3385 


Wave- 


12,0,18 


0.0524 


13,0,17 


0.9717 


18,8,4 


0.0015 


9,13,8 


0.6606 


23,2,5 


0.0003 


23,1,6 


0.0008 


Wine 


13,8,9 


0.2080 


12,6,12 


0.9250 


4,18,8 


0.4976 


8,18,4 


0.0655 


11,10,9 


0.3918 


13,7,10 


0.4832 


Zoo 


26,2,2 


<0.0001 


25,3,2 


<0.0001 


9,18,3 


0.0290 


12,14,4 


0.0814 


8,17,5 


0.1653 


2,6,22 


<0.000 


Average 


6,0,4 


0.2356 


6,1,3 


0.1760 


7,1,2 


0.0898 


7,0,3 


0.1719 


6,1,3 


0.1457 


5,0,5 


0.6777 



Trial “win, tie, loose” for example, in comparison of CM-based ensemble before 
and after iterations (CM_a versus CM_b) for Car data set means that in 20 cases of 30 
CM_a has higher accuracy, in 5 cases accuracies are comparable, and in 5 cases accu- 
racy was higher for CM_b- Small P value means that it is unlikely that the effect of 
iterations is due to a coincidence of random sampling of 30 runs from the overall 



2 Values of these statistics outside the range of -2 to +2 indicate significant abnormality, 
which would tend to invalidate any statistical test regarding the standard deviation, and the 
paired t-test in our case. 






166 S. Puuronen, I. Skrypnyk, and A. Tsymbal 



population. If P value is < 0.05 the observed effect is statistically significant. Large P 
value in some cases do not give us any reason to conclude that some effect has a 
place, as far as we can not observe it on current data sample. In order to compare the 
effect of iterations the one-tailed t-test is used since we can assume that accuracy after 
refinement tend to be higher than before refinement. Corresponding P value rejects or 
accepts hypothesis about the effect of the cycle. Comparing heuristic -based ensem- 
bles versus C4.5 as well as CM-based ensemble versus correlation-based we cannot 
predict in advance the accuracy trend. That is why for these cases two-tailed P value 
is used indicating presence of statistically significant difference in accuracy. Which 
algorithm outperforms can be seen from the sign test trial. 

First, let us compare two heuristic-based ensembles. On average, CM_b outper- 
forms CR_b 6 times according to the sign test. The benefits of CM_b are confirmed 
as statistically significant for LED_17 and Zoo. CM_b is a few more better than 
CR_b on Lymph, Wine, and Car. However, according to the t-test for Thyroid superi- 
ority of CR_b is statistically significant. On Glass and Waveform CR_b is a few more 
better as well. 3 Both ensembles CM_b and CR_b, as well as CM_a and CR_a on Iris 
data set reached the highest accuracy, and the lowest accuracy on LED_17 data set. 

After iterations CM_a still outperforms CR_a in 6 cases, in one case they are com- 
parable, and in 3 cases CR_a is better than CM_a. CM_a gives statistically significant 
superiority in accuracy than CR_a for Car, LED_17 and Zoo, and a few smaller supe- 
riority on Lymph data set. CM_b and CM_a are almost comparable on Wine data set. 

Thus, we can conclude that CM-based ensembles are more accurate than correla- 
tion-based ensembles. However, CR_b in the most of cases produces more diverse 
initial classifiers, especially for Glass, Iris, LED_17, Thyroid, Vehicle, Waveform, 
and Zoo. Diversity of CM_b and CR_b is almost comparable for Wine data set only. 
After iterations diversities smooth out for Glass, Lymph, and Wine. In general, the 
iterations resulted in larger increase for CM-based ensemble. The advantage of corre- 
lation-based ensemble is revealed in small number of features on which the base 
classifiers are constructed. It is clearly seen for LED_17, Vehicle, Wine, Waveform, 
and Zoo data sets. The number of iterations is the other important item for compari- 
sons. We can observe that for 5 data sets CM-based ensemble, in average, made more 
iterations, for 1 data set the average numbers of iterations are equal, and on 4 data sets 
correlation-based ensemble made more iterations. 

Let us analyze now the contribution of iterative refinement cycle intended for in- 
crease of diversity and accuracy in ensemble. Diversity was increased over all data 
sets on 1-7%. Accuracy was increased with both heuristic -based ensembles approxi- 
mately on 1-2%. For CM-based ensemble the effect of iterations is statistically sig- 
nificant on Car, Thyroid, Waveform, and Zoo. Sometimes overfitting can take place 4 , 
as in this case for Iris and Lymph, and with correlation-based iteration for Thyroid 
and Waveform data sets. Improvement in accuracy for correlation-based ensemble is 



3 The situation can be found out slightly differing from Table 1 inasmuch as this table in- 
cludes accuracies averaged over 30 runs. 

4 The accuracies are calculated using the evaluation set, and if the iteration results in overfitting 
with respect to the test set used during iteration than the final accuracy can become smaller. 




Ensemble Feature Selection Based on Contextual Merit and Correlation Heuristics 



167 



statistically significant for Car and LED_17 data sets. For the other data sets we can- 
not conclude that there is no effect of iterations. It just unobserved on current data. 

Finally, we compare heuristic-based ensembles with the C4.5 learning algorithm. 
Fleuristic-based ensembles use much less amount of features in order to construct 
classifiers, as it is shown in Table 2. The t-test indicated that CM_a is significantly 
better than C4.5 for Car and Waveform, and C4.5 outperformed on Vehicle. The sign 
test over all data sets sums up that CM_a works in 6 cases better, in 3 cases worse, 
and in 1 case tie. The sign test for CR_a shows that CR_a and C4.5 are comparable. 
CR_a is significantly better for Glass and Waveform data sets, whereas C4.5 is sig- 
nificantly better for Car, LED_17 and Zoo data sets. 



6 Conclusions 

Ensembles of classifiers can be constructed by a number of methods with the purpose 
of creating a set of diverse and accurate base classifiers. Feature selection techniques, 
along with other techniques are applied to prepare the training sets for construction of 
the base classifiers. In this paper, we have analyzed and experimented with two fea- 
ture selection heuristics. Both the CM-based and the correlation-based ones rely on 
quite similar assumptions concerning heterogeneous classification problems. We 
produced ensembles including as many base classifiers as there are classes and each 
base classifier was produced by C4.5 to distinguish instances of one class from the 
other classes. Each classifier is based on a subset of features and these features are 
selected using the CM-based or correlation-based merit values. 

In order to refine the ensemble characteristics, we applied iterative refinement dur- 
ing the final ensemble generation process. The refinement cycle provides feedback 
promoting more diverse set of base classifiers taking into account global accuracy. 

We have evaluated our approach on a number of data sets from the UCI machine 
learning repository. Experiments showed that CM-based approach often outper- 
formed in accuracy than correlation-based approach, however, the latter in the most 
of cases produces more diverse classifiers. Iterations have usually greater effect with 
CM-based approach making the difference in diversity smaller at the end. The corre- 
lation-based approach has an advantage producing more simple base classifiers than 
CM-based because of small number of features used. The iterative refinement cycle 
increase diversity for CM-based approach more effectively than for correlation-based 
one. As far as iterations promote more diverse classifiers it seems that such a refine- 
ment is more preferable for a heuristic producing the base classifiers of small diver- 
sity. Further research is also needed to found more beneficial iterative refinement for 
the correlation-based approach. CM-based ensemble in many cases works better than 
C4.5. Correlation-based ensemble is comparable in accuracy with C4.5, at least with 
current iterative refinement used. 




168 S. Puuronen, I. Skrypnyk, and A. Tsymbal 



References 

1. C., Hong, S.J., Hosking, J.R.M., Lepre, J., Pednault, E.P.D., Rosen, B.K.: Decomposition 
Apte of heterogeneous classification problems. Advances in Intelligent Data Analysis, 
Springer- Verlag, London (1997) 17-28. 

2. Batitti, R., Colla, A.M.: Democracy in neural nets: voting schemes for classification. Neu- 
ral Networks, Vol. 7, No. 4 (1994) 691-707. 

3. Bauer, E. Kohavi, R.: An empirical comparison of voting classification algorithms: bag- 
ging, boosting, and variants. Machine Learning, Vol. 36 (1999) 105-139. 

4. Dash, M., Liu. H.: Feature selection for classification. Intelligent Data Analysis, Vol. 1, 
No. 3, Elsevier Science (1997). 

5. Dietterich, T. Machine learning research: four current directions. Artificial Intelligence, 
Vol. 18, No. 4 (1997) 97-136. 

6. Fan, W., Stolfo, S., Chan, P.: Using conflicts among base classifiers to measure the per- 
formance of stacking. In: Proc. 16 th Int. Conf. on Machine Learning (ICML'99) Workshop 
on Recent Advanced in Meta-learning and Future Work, (1999) 10-17. 

7. Hall, M.: Correlation-based feature selection for discrete and numeric class machine learn- 
ing. In: Proc. 17 th Int. Conf. on Machine learning, Stanford University, CA, Morgan Kauf- 
mann Publishers (2000). [http://www.iit.nrc.ca/bibliographies/feature-selection.html] 

8. Hansen, L., Salamon, P.: Neural network ensembles. In: IEEE Transactions on Pattern 
Analysis and Machine Intelligence, Vol. 12 (1990) 993-1001. 

9. Hong, S.J.: Use of contextual information for feature ranking and discretization. IEEE 
Transactions on knowledge and Data Engineering, Vol. 9, No. 5 (1997) 718-730. 

10. Howe, N., Cardie C.: Examining locally varying weights for nearest neighbor algorithms. 
Lecture Notes in Artificial Intelligence, Springer (1997) 455-466. 

1 1. Kohavi, R., Sommerfield, D., Dougherty, J.: Data mining using MLC++: a machine learn- 
ing library in C++. Tools with Artificial Intelligence. IEEE CS Press (1996) 234-245. 

12. Kononenko, I., Simec, E., Robnik, M.: Overcoming the myopia of inductive learning 
algorithms with RELIEF. Applied Intelligence, Vol. 7 (1997) 39-55. 

13. Merz, C.J.. Murphy, P.M.: UCI Repository of Machine Learning Data Sets 
[http://www.ics.uci.edu/ ~mlearn/MLRepository.html]. Dep-t of Information and CS, Un- 
ty of California, Irvine, CA (1998). 

14. Opitz, D.. Maclin, R.: Popular ensemble methods: an empirical study. Artificial Intelligent 
Research, Vol. 11 (1999), 169-198. 

15. Opitz, D.: Feature selection for ensembles. In: 16 th National Conf. on Artificial Intelligence 
(AAAI), Orlando, Florida (1999) 379-384. 

16. Oza, N., Turner. K.: Dimensionality Reduction Through Classifier Ensembles. Tech. Rep. 
NAS A- ARC-IC- 1 999- 1 26. 

17. Prodromidis, A. L., Stolfo, S. J., Chan P. K.: Pruning classifiers in a distributed meta- 
leaming system. In: Proc. 1“ National Conference on New Information Technologies, 
(1998) 151-160. 

18. Puuronen, S., Skrypnyk, I., Tsymbal, A.: Ensemble feature selection based on the contex- 
tual merit. In: Proc. 3 rd Int. Conf. on Data Warehousing and Knowledge Discovery 
(DaWaK'01), (2001) (to appear). 

19. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, 
California (1993). 

20. Skrypnyk, I.. Tsymbal, A., Puuronen, S.: Local feature selection for heterogeneous prob- 
lems. In: Proc. 2 nd Int. Conf. on Data Mining 2000, WIT Press (2000) 203-212. 

21. Turner, K., Ghosh, J.: Error correlation and error reduction in ensemble classifiers. Con- 
nection Science, Vol. 8 Nos. 3,4 (1996) 385-404. 




Interactive Constraint-Based Sequential Pattern 

Mining* 



Marek Wojciechowski 



Poznan University of Technology 
Institute of Computing Science 
ul. Piotrowo 3a, 60-965 Poznan, Poland 
Marek. Wo jciechowskiOcs .put .poznan.pl 



Abstract. Data mining is an interactive and iterative process. It is very 
likely that a user will execute a series of similar queries differing in pat- 
tern constraints and mining parameters, before he or she gets satisfying 
results. Unfortunately, data mining algorithms currently available suffer 
from long processing times, which is unacceptable in case of interactive 
mining. In this paper we discuss efficient processing of sequential pat- 
tern queries utilizing cached results of other sequential pattern queries. 
We analyze differences between sequential pattern queries and propose 
algorithms that in many cases can be used instead of time-consuming 
mining algorithms. 



1 Introduction 

Data mining aims at discovery of useful patterns from large databases or ware- 
houses. One of the most popular data mining methods is sequential pattern dis- 
covery introduced in [2]. Informally, sequential patterns are the most frequently 
occurring subsequences in sequences of sets of items. The initial formulation of 
the problem was significantly extended in [10], where a taxonomy on items was 
added to support discovery of so called generalized sequential patterns, and three 
time constraints (min-gap, max-gap, and time window) were introduced to be 
used when checking if a given source sequence contains a given pattern. For that 
extended problem formulation, an efficient algorithm called GSP was proposed. 
Applications of sequential patterns include analysis of telecommunication sys- 
tems, discovering frequent buying patterns, analysis of patients’ medical records, 
etc. 

From a user’s point of view, data mining can be seen as an interactive and 
iterative process of advanced querying: a user specifies the source dataset and 
the requested class of patterns, the system chooses the appropriate data mining 
algorithm and returns discovered patterns to the user [4] [6]. A user interacting 
with a data mining system has to specify several constraints on patterns to be 
discovered. However, usually it is not trivial to find a set of constraints leading 

* This work was partially supported by the grant no. KBN 43-1309 from the State 
Committee for Scientific Research (KBN), Poland. 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 169-181, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




170 M. Wojciechowski 



to the satisfying set of patterns. Thus, users are very likely to execute a series of 
similar data mining queries before they find what they need. Unfortunately, data 
mining algorithms require long processing times, which makes such interaction 
difficult. 

In this paper, we discuss efficient sequential pattern discovery in the presence 
of materialized results of previous sequential pattern queries. We claim that a 
data mining system should exploit the fact that a user is very likely to execute 
a number of similar sequential pattern queries during a single session. We pro- 
pose caching results of mining queries by materializing their results on disk (we 
assume that a data mining system is going to be assigned a certain amount of 
disk space for that purpose). It is obvious that materialized results of a query 
can be used to answer an identical query, therefore we concentrate on processing 
queries different from those whose results are available. The possibility of an- 
swering a query using known results of another query depends on the differences 
between the two queries. Our goal is to provide criteria for determining if cached 
results of a given query can be used to answer the current query without running 
a complete mining algorithm, and introduce efficient sequential pattern query 
processing algorithms exploiting materialized patterns. 

Exploiting cached results of previous mining queries has been studied in 
the context of association rules [3] [7]. However, direct application of methods 
and techniques introduced for association rules to sequential pattern discovery 
problem is not possible since different types of constraints are available in the two 
problems. Nevertheless, it seems that the general ideas should stay unchanged. 

In has been observed [3] that the three particularly interesting relationships 
between two mining queries DMQi and DMQ2 extracting patterns from the 
same data are equivalence, inclusion, and dominance. The three relationships 
are interesting since they represent situations, where one data mining query can 
be efficiently answered using the results of another query. Differences between 
mining queries leading to these relationships were analyzed only in the context 
of association rules. In this paper we present analogous analysis concerning se- 
quential patterns. Thus, most of our work can be regarded as the extension of 
the approach from [3] into sequential pattern discovery. 

1.1 Sequential Patterns 

Let L = 1 1, 12 , ..., l m be a set of literals called items. An itemset is a non-empty 
set of items. A sequence is an ordered list of itemsets and is denoted as < 
XiX2-..X n >, where Xi is an itemset (A,; C L). X, is called an element of the 
sequence. The size of a sequence is the number of items in the sequence. The 
length of a sequence is the number of elements in the sequence. Let D be a 
set of variable length sequences (called data-sequences ) , where for each sequence 
S =< X 1 X 2 ...X n > , a timestamp is associated with each Xi. 

With no time constraints we say that a sequence X =< X\X2-..X n > is 
contained in a data-sequence Y =< Y{Y2...Y m > if there exist integers i\ < i2 < 
... < i n such that X\ C Y i± , X 2 C Y i2 , ...,X n C Y in . We call < Y il Yi 2 ...Y in > an 
occurrence of X in Y . We consider the following user-specified time constraints 




Interactive Constraint-Based Sequential Pattern Mining 



171 



while looking for occurrences of a given sequence: minimal and maximal gap 
allowed between consecutive elements of an occurrence of the sequence (called 
min- gap and max-gap), and time window that allows a group of consecutive 
elements of a clata-sequence to be merged and treated as a single element as 
long as their timestamps are within the user-specified window-size. 

The support, of a sequence < X\X2—X n > in D is the fraction of data- 
sequences in D that contain the sequence. A sequential pattern is a sequence 
whose support in D is above the user-specified threshold. 

1.2 Relationships between Results of Data Mining Queries 

Two data mining queries are equivalent if for all datasets they both return the 
same set of patterns and the values of statistical significance measures (e.g. 
support) for each pattern are the same in both cases. A data mining query 
DMQi includes a data mining query DMQ2 if for all datasets each pattern in 
the results of DMQ 2 is also returned by DMQi with the same values of the 
statistical significance measures. A data mining query DMQi dominates a data 
mining query DMQ2 if for all datasets each pattern in the results of DMQ2 is 
also returned by DMQi , and for each pattern returned by both queries its values 
of the statistical significance measures evaluated by DMQi are not less than is 
case of DMQ2- Equivalence is a particular case of inclusion, and inclusion is a 
particular case of dominance. Equivalence, inclusion, and dominance meet the 
transitivity property. 

If for a given query, results of a query equivalent to it, including it, or dom- 
inating it are available, the query can be answered without running a costly 
mining algorithm. In case of equivalence no processing is necessary, since the 
queries have the same results. In case of inclusion, one scan of the materialized 
query results is necessary to filter out patterns that do not satisfy constraints 
of the included query. In case of dominance, one verifying scan of the source 
dataset is necessary to evaluate the statistical significance of materialized pat- 
terns (filtering out the patterns that do not satisfy constraints of the dominated 
query is also required). 

1.3 Related Work 

To facilitate interactive and iterative pattern discovery, [8] proposed to materi- 
alize patterns discovered with the least restrictive selection criteria, and answer 
incoming queries by filtering the materialized pattern collection. This approach 
is not a perfect solution of the problem since pattern mining with very low 
minimum support thresholds might lead to collections of frequent patterns even 
larger than the original database. Moreover, restricting certain constraints (e.g. 
time constraints in the context of sequential pattern mining) not only makes 
some patterns infrequent but also changes the support of patterns that remain 
frequent . 

Much more reasonable and flexible solutions supporting interactive and it- 
erative mining were presented in [7], in the context of association rules. The 




172 M. Wojciechowski 



solutions presented there consisted in caching results of mining queries. In the 
approach, materialization of frequent itemsets instead of rules was proposed. 
However, in some cases it was required to materialize also some of the infre- 
quent itemsets. 

Most of the research on sequential patterns focused on introducing new al- 
gorithms, more efficient than GSP (e.g. [5] [9]). However, the novel methods do 
not handle time constraints and taxonomies. Thus, GSP still remains the most 
general sequential pattern discovery algorithm and the reference point for new 
methods and techniques. 



1.4 Organization of the Paper 

The paper is organized as follows. Section 2 presents constraints that can be 
specified in sequential pattern mining. In Sect. 3, relationships between sequen- 
tial pattern queries are discussed. Section 4 contains efficient sequential pattern 
query processing algorithms. Experimental results concerning the proposed al- 
gorithms are presented in Sect. 5. We conclude with a summary in Sect. 6. 

2 Constraint-Based Sequential Pattern Mining 

In constraint-based sequential pattern mining, we identify the following classes 
of constraints: database constraints, pattern constraints, and time constraints. 
Database constraints are used to specify the source dataset. Pattern constraints 
specify which patterns are interesting and should be returned by the query. 
Finally, time constraints influence the process of checking whether a given clata- 
sequence contains a given pattern. 

The basic formulation of the sequential pattern discovery problem introduces 
three time constraints: max-gap, min-gap, and time window, and assumes only 
one pattern constraint (expressed by means of the minimum support threshold) . 
We model pattern constraints as complex Boolean predicates having the form of 
a conjunction of basic Boolean predicates on patterns presented below: 

— 7t(SPL, a, pattern) - true if pattern support is less than a, false otherwise; 

— 7T (SPG, a, pattern) - true if pattern support is greater than a, false other- 
wise; 

— 7r(SL, a, pattern) - true if pattern size is less than a, false otherwise; 

— 7 r( SG, a, pattern) - true if pattern size is greater than a, false otherwise; 

— 7 r (LL, a, pattern) - true if pattern length is less than a , false otherwise; 

— 7 r (LG, a, pattern) - true if pattern length is greater than a, false otherwise; 

— 7r(C, /3, pattern) - true if /3 is a subsequence of the pattern, false otherwise; 

— 7t(NC, (3, pattern) - true if /3 is not a subsequence of the pattern, false 
otherwise. 

We believe that the above list of predicates is sufficient to allow users to express 
their pattern selection criteria. For simplicity’s sake, in length and size predicates 
we consider only sharp inequalities. 




Interactive Constraint-Based Sequential Pattern Mining 



173 



3 Relationships between Sequential Pattern Queries 

Inclusion and dominance relationships between two data mining queries are de- 
fined for queries operating on the same dataset. Therefore, analyzing differences 
between sequential pattern queries, we consider only differences in time and 
pattern constraints. 

Definition 1 . Given two basic Boolean pattern predicates b\ and 62, we say that 
62 is stronger than bi if one of the following conditions holds: 

1. b\ = 7 T fSPG, aq, pattern) and 62 = tt^SPG, <22. pattern), where c*2 > aq, 

2 . b\ = tt^SPL, aq, pattern) and 62 = tt^SPL, <22. pattern), where <22 < oq, 

3. b\ = 7 T (SG, ai, pattern) and 62 — 7r(SG, a 2, pattern), where 0,2 > ol\, 

4- b\ = tt fSL, a\, pattern) and 62 = 71YSL, 0,2 , pattern), where a.2 < ot\, 

5. b\ = 7T (LG, oq , pattern) and 62 = irfLG, a.2, pattern), where <22 > aq, 

6. b\ = 7r(XL, a\, pattern) and &2 = 7t(X.L, 02, pattern), where <22 < aq, 

7 . b\ = tt(C, 01, pattern) and 62 = 7rfC, fo, pattern), where a pattern is a 
subsequence of the pattern @2 and the size of Pi is less than the size of P2, 

8. b\ = irfNC, Pi, pattern) and b 2 = irfNC, P2, pattern), where pattern P2 is 
a subsequence of the pattern Pi and the size of P2 is less than the size of pi. 

Definition 2. We say that a data mining query DMQ2 extends pattern con- 
straints of a data mining query DMQi if any of the following conditions holds: 

1. Pattern constraints of DMQi have a form of a conjunction of n basic 
Boolean pattern predicates, pattern constraints of DMQ 2 have a form of 
a conjunction of n + 1 basic Boolean pattern predicates (n >0), and each 
basic Boolean pattern predicates in DMQi also appears in DMQ2; 

2. DMQi and DMQ2 have pattern constraints pi and P2 respectively, where 
Pi and P2 are conjunctions of n basic Boolean pattern predicates (n > \), 
Pi = p A bi , P2 = p A 62 (p is a conjunction of n — 1 basic Boolean pattern 
predicates) , and 62 is stronger than b 1; 

3. It is possible to formulate a data mining query DMQ% such that DMQ2 ex- 
tends pattern constraints of DM Q^ and DMQ3 extends pattern constraints 
of DMQ\. (The relationship of extending pattern constraints is transitive.) 

In other words, a data mining query DMQ 2 extends pattern constraints of a 
data mining query DMQi if pattern constraints of DMQi can be transformed 
into pattern constraints of DMQ2 by appending new basic Boolean pattern 
predicates or replacing basic Boolean pattern predicates with stronger ones. 

Given two sequential pattern queries, there are four cases possible regard- 
ing pattern constrains: DMQi and DMQ2 have the same pattern constraints, 
DMQi extends pattern constraints of DMQ2, DMQ2 extends pattern con- 
straints of DMQi, or pattern constraints of DMQi and DMQ 2 are not compa- 
rable. 

Definition 3 . We say that a data mining query DMQ2 extends time con- 
straints of a data mining query DMQi if any of the following conditions holds: 




174 M. Wojciechowski 



1. The value of the max-gap parameter in DAIQ2 is less than in DAIQ 1 and 
both queries have the same value of the min-gap parameter, and the same 
value of the window-size parameter; 

2. The value of the min-gap parameter in DMQ2 is greater than in DAIQ 1 and 
both queries have the same value of the max-gap parameter, and the same 
value of the window-size parameter; 

3. The value of the window-size parameter in DMQ 2 is less than in DMQi 
and both queries have the same value of the max-gap parameter, and the 
same value of the min-gap parameter; 

f. It is possible to formulate a data mining query DAIQ 3 such that DMQ2 
extends time constraints of DMQ3 and DAIQ 3 extends time constraints of 
DMQi. (The relationship of extending time constraints is transitive.) 

In other words, a data mining query DAIQ2 extends time constraints of a data 
mining query DMQi if it restricts at least one of the time parameters (max-gap, 
min-gap, window-size) and does not relax any time parameters. 

Given two sequential pattern queries, there are four cases possible regarding 
time constrains: DAIQi and DAIQ2 have the same time constraints, DAIQ 1 
extends time constraints of DAIQ 2 , DMQ 2 extends time constraints of DAIQi , 
or time constraints of DAIQi and DAIQ2 are not comparable. 

Example 1. Let us consider the following three sequential pattern queries, oper- 
ating on the same dataset: 

DAIQi = {max-gap: 100, min-gap: 0, window-size: 1, 7 r(SPG, 0.2, pattern)} 
DAIQ 2 = {max-gap: 100, min-gap: 0, window-size: 1, 7 r(SPG, 0.1, pattern) A 
7 r(SG, 3, pattern)} 

DAIQ 3 = {max-gap: 100, min-gap: 7, window-size: 1, 7 r(SPG, 0.2, pattern) A 
7 r(SG, 3, pattern)} 

DAIQ3 extends pattern constraints of DAIQi and DAIQ 2 , while pattern con- 
straints of DMQi and DAIQ2 are not comparable. DAIQ3 extends time con- 
straints of DAIQi and DAIQ2 , while time constraints DAIQi and DMQ2 are 
the same. 

The two relationships defined above concern the syntax of queries, while the 
general inclusion and dominance relationships refer to results of queries. Below 
we introduce three theorems regarding dependence of relationships between re- 
sults of two queries on syntactic differences between the two queries. We also 
introduce several lemmas on which the proofs of theorems are based. For brevity, 
we do not include proofs of the lemmas since they come straight from the above 
definitions and inherent properties of pattern and time constraints. 

Lemma 1 . Let bi and &2 be basic Boolean pattern predicates such that 62 Is 
stronger than bi. For each pattern p, if p satisfies &2 then p satisfies b\. 

Lemma 2 . Let DMQi nnd DAIQ2 be two sequential pattern queries, operating 
on the same dataset and having the same time constraints. Let pi and P2 denote 
pattern constraints of DMQi nnd DAIQ2 respectively. If P2 = Pi A b, where b is 
a basic Boolean pattern predicate, then DAIQi includes DMQ2- 




Interactive Constraint-Based Sequential Pattern Mining 



175 



Lemma 3. Let DMQi and DMQ 2 be two sequential pattern queries, operating 
on the same dataset and having the same time constraints. Let pi and P 2 denote 
pattern constraints of DMQi and DMQ 2 respectively. If Pi = p A b\ and p 2 = 
p A b 2 , where p is a conjunction of n basic Boolean pattern predicates (n > 0) 
and b 2 is stronger than b±, then DMQi includes DMQ 2 . 

Theorem 1. Let DMQi and DMQ 2 be two sequential pattern queries, operat- 
ing on the same dataset and having the same time constraints. If DM Q 2 extends 
pattern constraints of DMQi, then DMQi includes DMQ 2 . 

Proof. From the Definition 2, we know that if DMQ 2 extends pattern constraints 
of DMQi , then it is possible to formulate a sequence of sequential pattern queries 
DM Q ilt DMQi 2 , ..., DMQi n operating on the same dataset and having the 
same time constraints as DMQi and DMQ 2 , such that DMQi x = DMQi and 
DMQ in = DMQ 2 , and for j = 2 ..n one of the following conditions holds: 

1. pattern constraints of DMQ ij _ 1 have a form of a conjunction of n basic 
Boolean pattern predicates, pattern constraints of DMQi. have a form of 
a conjunction of n + 1 basic Boolean pattern predicates (n > 0), and each 
basic Boolean pattern predicates in DMQ ij _ 1 also appears in DMQi f 

2. DMQ ij _ 1 and DMQ ij have pattern constraints pi and p 2 respectively, where 
Pi and p 2 are conjunction of n basic Boolean pattern predicates ( n > 1), 
Pi = p A bi, p 2 = p A b 2 (p is a conjunction of n — 1 basic Boolean pattern 
predicates), and b 2 is stronger than bi. 

From the Lemmas 2 and 3 and the transitivity property of the inclusion rela- 
tionship, we have DMQi includes DMQ 2 . 



Lemma 4. Let DMQi and DMQ 2 be two sequential pattern queries, operating 
on the same dataset and having the same pattern constraints. Let max 1 , mini, 
and wini denote values of max-gap, min-gap, and window-size parameters of 
DMQi, and max 2 , min 2 , and win 2 values of max- gap, min-gap, and window- 
size parameters of DMQ 2 . If one of the following conditions holds: 

1. max 2 < max 1 and min 2 = mini and win 2 = wini, 

2. min 2 > mini and max 2 = maxi and win 2 = wini, 

3. win 2 < wini and max 2 = maxi and min 2 = mini 

then DMQi dominates DMQ 2 . 

Theorem 2. Let DMQi and DMQ 2 be two sequential pattern queries, oper- 
ating on the same dataset and having the same pattern constraints. If DMQ 2 
extends time constraints of DMQi, then DMQi dominates DMQ 2 . 

Proof. Let maxi, mini, and wini denote values of max-gap, min-gap, and 
window-size parameters of DMQi, and max 2 , min 2 , and win 2 values of max- 
gap, min-gap, and window-size parameters of DMQ 2 . Since DMQ 2 extends time 




176 M. Wojciechowski 



constraints of DMQ i, we have: win 2 < win\, max 2 < max 1 and mm2 > mini. 
Let DMQ 3 and DMQ 4 be sequential pattern queries operating on the same 
dataset and having the same pattern constraints as DMQi and DMQ2, Let the 
values of max-gap, min-gap, and window-size parameters be max 2, mini , and 
wini in case of DMQ 3 , and max 2, mm2, and win± in case of DMQ 4. Thus, 
from the Lemma 4, DMQ 1 dominates DMQ 3 , DMQ 3 dominates DMQ4 , and 
DMQ 4 dominates DMQ2 (in fact, in each of the three cases equivalence is pos- 
sible but equivalence is a particular case of dominance). Since the dominance 
relationship is transitive, DMQ 1 dominates DMQ2. 

Theorem 3 . Let DMQi and DMQ 2 be two sequential pattern queries, operat- 
ing on the same dataset. If DMQ 2 extends pattern constraints of DMQi and 
DMQ2 extends time constraints of DMQi, then DMQi dominates DMQ 2 . 

Proof. Let DMQ 3 be a sequential pattern query operating on the same dataset 
as DMQi and DMQ2, having pattern constraints of DMQi and time con- 
straints of DMQ 2- Thus, DMQ 2 extends pattern constraints of DMQ 3 and 
DMQ 3 extends time constraints of DMQi. From the Theorems 1 and 2 we 
have: DMQi dominates DMQ 3 and DMQ 3 includes DMQ 3 . Since inclusion 
is a particular case of dominance and the dominance relationship is transitive, 
DMQi dominates DMQ2- 

4 Algorithms for Efficient Sequential Pattern Query 
Processing in the Presence of Materialized Results of 
Previous Queries 

Given a sequential pattern query DMQ and materialized results of a sequential 
pattern query DMQy, in the general case, even if DMQy and DMQ operate on 
the same dataset but differ in pattern and time constraints, it is not possible to 
answer DMQ without running a sequential pattern mining algorithm. However, 
there are four particular cases where DMQ can be answered efficiently using the 
materialized results of DMQy since they correspond to equivalence, inclusion, 
and dominance relationships between DMQy and DMQ. These cases are listed 
below: 

1. If DMQy and DMQ have the same pattern and time constraints, then the 
results of DMQ are equal to the results of DMQy (the two queries are 
equivalent since they are identical); 

2. If DMQy and DMQ have the same time constraints and DMQ extends 
pattern constraints of DMQy, then DMQ can be answered by filtering out 
the patterns returned by DMQy not satisfying pattern constraints of DMQ 
(DMQy includes DMQ according to the Theorem 1); 

3. If DMQy and DMQ have the same pattern constraints and DMQ extends 
time constraints of DMQy, then DMQ can be answered by evaluating the 
support of the patterns returned by DMQy using the time constraints of 
DMQ , and filtering out patterns not satisfying the minimum support thresh- 
old of DMQ. ( DMQy dominates DMQ according to the Theorem 2); 




Interactive Constraint-Based Sequential Pattern Mining 



177 



4. If DMQ extends pattern constraints of DMQy and DMQ extends time con- 
straints of DMQy , then DMQ can be answered by evaluating the support 
of the patterns returned by DMQy using the time constraints of DMQ , 
and filtering out patterns not satisfying the pattern constraints of DMQ. 
( DMQy dominates DMQ according to the Theorem 3). 

Answering the query in the first case (the case of equivalence) is trivial, therefore 
we concentrate on details concerning inclusion and dominance relationships. 

For the second case we propose an algorithm that performs one sequential 
scan of the materialized patterns, processing one pattern at a time (main mem- 
ory requirements are minimal). Each pattern is tested if it satisfies these basic 
Boolean pattern predicates from the pattern constraints of DMQ that were 
not in DMQy. All the basic Boolean pattern predicates of DMQ that were 
in DMQy must be satisfied by all the materialized patterns since pattern con- 
straints in our model have the form of a conjunction of basic predicates. The 
algorithm for the second case is presented below. 

Algorithm 1 Answering a sequential pattern query in case of inclusion due to 
extending pattern constraints (Result Filtering) 

Input: A sequential pattern query issued by a user {DMQ) and results of a 
sequential pattern query DMQy including DMQ. 

Output: The results of DMQ. 

Method: 

begin 

Answer = results of DMQy, 
for each p £ results of DMQy do 

begin 

for each basic Boolean pattern predicate b such that 
b is in pattern constraints of DMQ and 
b is not in pattern constraints of DMQy do 

begin 

if not ( p satisfies b) then 
Answer = Answer \ {p}; 

break; 
end if; 
end; 
end; 

output Answer ; 

end. 

For the third and fourth cases we propose one uniform algorithm (both cases re- 
sult in the dominance relationship) . Conceptually, the algorithm has to scan the 
source dataset once in order to re-evaluate the support of materialized patterns 
and then prune the patterns that do not satisfy pattern constraints of DMQ. 
However, for the fourth case, we apply one optimization to reduce the cost of 
the support re-evaluation phase that is proportional to the number of patterns 




178 M. Wojciechowski 



to be verified. Before scanning the source dataset, we filter out patterns that do 
not satisfy pattern constraints of DMQ using Algorithm 1. After the scan of the 
dataset, we only test the predicate representing the minimum support thresh- 
old (the only one that for a given pattern could by true before the support 
re-evaluation, and false after that operation). The effects of this optimization 
will be discussed in the next section. 

During the support re-evaluation phase, when testing whether a currently 
processed data-sequence contains a given pattern, all time constraints of DMQ 
have to be taken into account, even if only one of them has been restricted 
compared to DMQy. This is motivated by the observation that a given pattern 
may occur several times in a given data-sequence. As a result, if we checked 
only one of the time constraints, we might find a different occurrence satisfying 
the constraint than the occurrence previously found as valid with respect to the 
other two time constraints. 

The algorithm in the form presented below assumes that the set of materi- 
alized patterns supporting pattern constraints of DMQ fits into main memory. 
If this is not the case, the set of materialized patterns has to be partitioned into 
portions that fit into main memory and the algorithm has to be run on each of 
the partitions. 

Algorithm 2 Answering a sequential pattern query in case of dominance due 
to extending time constraints (Result Verification) 

Input: A sequential pattern query issued by a user {DMQ), a collection of 
data-sequences D , and results of a sequential pattern query DMQy dominating 
DMQ. 

Output: The results of DM Q. 

Method: 

begin 

if DMQ extends pattern constraints of DMQy then 
Answer = patterns in results of DMQy satisfying 
pattern constraints of DMQ\ /* Algorithm 1 */ 
else Answer = results of DMQy ; 

end if; 

scan D once evaluating the support of patterns 
in Answer using time constraints of DMQ ; 
for each p £ Answer do 
begin 

if p exceeds the minimum support threshold of DMQ 
then output p: end if; 
end; 
end. 

Having provided sequential pattern query processing algorithms for the cases 
leading to equivalence, inclusion and dominance relationships, we have to ad- 
dress situations where for a given query issued by a user {DMQ), there are 




Interactive Constraint-Based Sequential Pattern Mining 



179 



many materialized query results that could be used to answer the query without 
running a complete data mining algorithm. In general, the set of applicable mate- 
rialized query results consists of results of queries equivalent to DMQ , including 
DMQ, and dominating DMQ. It is clear that in the first place the data mining 
system should look for a query identical to DMQ (the case of equivalence) since 
in that case the results of DMQ are directly available. Then, the system should 
look for query results that could be used by Algorithm 1 (returned by a query 
DMQy having the same time constraints as DMQ , such that DMQ extends 
pattern constraints of DMQy). If no query satisfying the above criteria could 
be found, the system should try to find query results that could be used by Algo- 
rithm 2 (returned by a query DMQy , such that DMQ extends time constraints 
of DMQy and either DMQy and DMQ have the same pattern constraints or 
DMQ extends pattern constraints of DMQy). Finally, if again no appropriate 
query criteria could be found, a complete data mining algorithm has to be run. 

We believe that in majority of cases Algorithm 1 will be more efficient than 
Algorithm 2 since the former requires one scan of the pattern set and no scan 
of the source dataset, while the latter scans the source dataset once and during 
this scan for each data-sequence processes all the patterns. However, it has to 
be noted that in certain cases application of Algorithm 2 may be more efficient 
than application of Algorithm 1 (for example, if the source dataset and the 
materialized set of patterns to be used by Algorithm 2 are extremely small, 
whereas the materialized pattern set to be used by Algorithm 1 is huge). 

The final issue that has to be addressed is the selection of the materialized 
query results to be used by Algorithms 1 and 2 if there is more than one query 
including or dominating the query to be answered. We observe that it is not 
possible to provide selection criteria always leading to the minimal processing 
time, because the processing time depends not only on the syntax of the queries 
but also on the contents of the source dataset. Therefore, we decide to optimize 
the space requirements by choosing the materialized pattern set of the smallest 
size. We believe that this solution will also lead to minimal processing time in 
many situations, since smaller size of the pattern set leads to the smaller number 
or size of patterns that have to be filtered or verified against the database. It 
is not guaranteed, however, since the processing time is affected also by the 
number of predicates that have to be tested for each pattern, which depends 
on the pattern structure (subsequent predicates are tested until one of them is 
found to be false). 

Example 2. Let us consider the following three queries discovering sequential 
patterns from the same dataset: 

DMQi = (max-gap: 100, min-gap: 0, window-size: 1, 7r(SPG, 0.2, pattern)} 
DMQ 2 = (max-gap: 100, min-gap: 7, window-size: 1, 7r(SPG, 0.1, pattern) A 
7 t(SG, 3, pattern)} 

DMQ = (max-gap: 100, min-gap: 7, window-size: 1, 7r(SPG, 0.2, pattern) A 
7 t(SG, 3, pattern)} 

Let us assume that results of DMQ 1 and DMQ2 are stored in cache, and DMQ 
is the query to be answered. Since neither DMQi nor DMQ2 is identical to 




180 M. Wojciechowski 



DMQ , the data mining system would choose to answer DMQ using Algorithm 
1 exploiting cached results of DMQ 2 (returning those patterns from the results of 
DMQ 2 that exceed the minimum support threshold of 0.2). If results of DMQ 2 
were not available, the system would answer DMQ using Algorithm 2 exploiting 
the results of DMQ\ (selecting patterns from the results of DMQ\ whose size is 
greater than 3, re-evaluating their support in one scan of the source dataset using 
max-gap of 100, min-gap of 7, and window-size of 1, and returning those patterns 
that exceed the minimum support threshold of 0.2 after support re-evaluation). 

5 Experimental Results 

In order to evaluate performance gains offered by our sequential pattern query 
processing algorithms, we performed several experiments on a synthetic dataset 
generated by means of the GEN generator from the Quest project [1] . We treated 
transaction identifiers generated by GEN as transaction times. Thus, the time 
gap between two adjacent elements of each clata-sequence was always equal to 
one time unit. The dataset used in the experiments consisted of 1000 data- 
sequences. GEN parameter values were chosen so that for the minimum support 
thresholds used in queries there were a reasonable number of sequential patterns 
varying in size and length to be discovered. 

In the first step we materialized the results of the query discovering all se- 
quential patterns whose support was above 0.5% using max-gap of 1000, min- 
gap of 0, and window-size of 1. The materialized set of patterns consisted of 
about 3500 sequential patterns. Next, we tested several queries adding addi- 
tional pattern constraints (concerning pattern support, size, length, or contents) 
and restricting time constraints. For each query, we compared execution times of 
our algorithms exploiting materialized patterns and the GSP algorithm with the 
post-processing pattern filtering phase. For the queries included by the materi- 
alized query, Algorithm 1 was on average more than 400 times faster than GSP. 
For the queries dominated by the materialized query, Algorithm 2 was used, and 
its processing time was on average more than 100 times shorter than in case 
of GSP. We also tested the effects of our optimization used in case of queries 
extending both pattern and time constraints of the materialized query (filtering 
out patterns that do not satisfy pattern constraints before re-evaluating the sup- 
port of materialized patterns). Experiments show that the optimization reduces 
processing time by about 33%. 

6 Concluding Remarks 

We proved experimentally that our sequential pattern query processing schemes 
can reduce processing time by several orders of magnitude when materialized 
results of previous queries are available. However, theoretically it is possible to 
imagine situations, where a complete mining algorithm could be more efficient 
than our techniques. While we believe that in typical situations our methods 




Interactive Constraint-Based Sequential Pattern Mining 



181 



should outperform mining algorithms, in the future we plan to focus on cost- 
based optimization of sequential pattern queries (and data mining queries in 
general) using certain statistics of the source dataset in order to choose an op- 
timal query execution plan. 

In the paper, we did not discuss cache management schemes, which could 
certainly influence the overall performance of the system. We believe that gen- 
eral purpose cache management algorithms could be used, possibly with simple 
optimizations such as removing included or dominated queries first, and not ma- 
terializing results of queries equivalent to queries whose results are already in 
cache. 

References 

1. Agrawal R., Mehta M., Shafer J., Srikant R., Arning A., Bollinger T.: The Quest 
Data Mining System. Proc. of the 2nd KDD Conference (1996) 

2. Agrawal R., Srikant R.: Mining Sequential Patterns. Proc. of the 11th ICDE Conf. 
(1995) 

3. Baralis E., Psaila G.: Incremental Refinement of Mining Queries. Proc. of the 1st 
DaWaK Conference (1999) 

4. Han J., Lakshmanan L., Ng R.: Constraint-Based Multidimensional Data Mining. 
IEEE Computer, Vol. 32, No. 8 (1999) 

5. Han J., Pei J., Mortazavi-Asl B., Chen Q., Dayal U., Hsu M-C.: FreeSpan: Frequent 
Pattern-Projected Sequential Pattern Mining. Proc. of the 6tli KDD Conference 
(2000) 

6. Imielinski T., Mannila H.: A Database Perspective on Knowledge Discovery. Com- 
munications of the ACM, Vol. 39, No. 11 (1996) 

7. Nag B., Deshpande P.M., DeWitt D.J.: Using a Knowledge Cache for Interactive 
Discovery of Association Rules. Proc. of the 5th KDD Conference (1999) 

8. Parthasarathy S., Zaki M.J., Ogihara M., Dwarkadas S.: Incremental and Interac- 
tive Sequence Mining. Proc. of the 8th CIKM Conference (1999) 

9. Pei J., Han J., Mortazavi-Asl B., Pinto H., Chen Q., Dayal U., Hsu M-C.: PrefixS- 
pan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. 
Proc. of the 17th ICDE Conference (2001) 

10. Srikant R., Agrawal R.: Mining Sequential Patterns: Generalizations and Perfor- 
mance Improvements. Proc. of the 5th EDBT Conference (1996) 




Evaluation of a Broadcast Scheduling Algorithm 



Murat Karakaya 1 and Ozgiir Ulusoy 2 



1 Department of Technical Sciences 
Turkish Land Forces Academy, Ankara 06100, Turkey 
2 Department of Computer Engineering 
Bilkent University, Ankara 06533, Turkey 
muratkOkho . edu . tr , oulusoyOcs . bilkent . edu . tr 



Abstract. One of the two main approaches of data broadcasting is pull- 
based data delivery. In this paper, we focus on the problem of scheduling 
data items to broadcast in such a pull-based environment. Previous work 
has shown that the Longest Wait First heuristic has the best performance 
results compared to all other broadcast scheduling algorithms, however 
the decision overhead avoids its practical implementation. Observing this 
fact, we propose an efficient broadcast scheduling algorithm which is 
based on an approximate version of the Longest Wait First heuristic. We 
also compare the performance of the proposed algorithm against well- 
known broadcast scheduling algorithms. 



1 Introduction 

There exist two main approaches for data dissemination in broadcast systems: 
push and pull [1,2,12,19]. In push-based data delivery, the information server 
tries to predict data needs using the knowledge provided by user profiles or sub- 
scriptions. The server constructs a broadcast schedule in which initiation of data 
transmission does not require an explicit request from mobile users. The server 
repetitively transmits the content of broadcast schedule to user population. Mo- 
bile users monitor the broadcast channel and retrieve the items they require as 
they arrive. On the other hand, in a pull-based environment, clients explicitly 
request data items by sending message to the server. The requests are compiled 
in a service queue, and a scheduling algorithm decides which data item should 
be broadcast. 

The main contributions of our work can be described as follows. First, pre- 
vious work has shown that the Longest Wait First (LWF) heuristic has the 
best performance results compared to all other broadcast scheduling algorithms, 
however the decision overhead avoids its practical implementation [6,7,19,5]. We 
propose to use an approximate version of the LWF heuristic which can consid- 
erably remove the decision overhead of LWF. Second, the implementation of the 
approximate heuristic is carefully designed and also parameterized to increase the 
performance with respect to different criteria. Third, some heuristics proposed to 
be used in push-based broadcast environments are modified and evaluated in the 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 182-195, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




Evaluation of a Broadcast Scheduling Algorithm 183 



pull-based broadcast environment we simulate. And, finally detailed simulation 
tests are conducted and reported. 

The remainder of the paper is organized as follows. In Sect. 2, we introduce 
a mobile computing environment that we assume in our work and summarize 
related work. In Sect. 3, we describe our approximate heuristic and its imple- 
mentation, Bucketing scheduling algorithm. Performance evaluation results of 
the proposed algorithm are provided and compared with the results of some 
well-known broadcast scheduling algorithms in Sect. 4. Finally, in Sect. 5 con- 
cluding remarks are provided. 



2 Background 

2.1 Mobile Computing Environment 

In a common architectural model used for a mobile computing environment [8, 
9,12], geographical area is divided into regions, called cells, each of which is cov- 
ered and serviced by a stationary controller. There exist two types of computers; 
mobile units (computers) (MUs) and stationary computers (SCs). SCs are con- 
nected together via a fixed network. Some of SCs are equipped with wireless 
interfaces to communicate with MUs and called mobile support stations (MSSs). 
MSSs behave as entry points from MUs to the fixed network. MUs can consume 
and also produce information by querying and updating the online database 
stored on SCs. MSSs can be proxy servers on behalf of the other SCs or they 
can themselves be information servers. 

It is assumed in this mobile environment that there is a single broadcast 
channel dedicated to data broadcast. Users monitor this channel continuously 
to get the data items they require. There is a backchannel which enables MUs 
to send data requests to MSSs. 



2.2 Related Work 

The first work related with broadcasting in a pull-based environment is by Am- 
mar and Wong in the context of teletext and videotext systems [6,7,19]. In [19], 
Wong proposes three alternative architectures for broadcast information deliv- 
ery systems: one-way broadcast (push), two-way interaction (pull), and one-way 
broadcast/two-way interaction (hybrid). The heuristics used in two-way interac- 
tion are as follows [19]: 

— The well-known FCFS algorithm has been modified in such a way that if a 
page has been requested and placed in the service queue, a new request for 
that page is ignored. In this way, redundant broadcasts of the same page are 
avoided [5]. 

— Another heuristic proposed to be used in broadcast scheduling is Most Re- 
quested First (MRF). As the name of the heuristic implies, the page with 
the largest number of pending requests is selected to broadcast. 




184 M. Karakaya and O. Ulusoy 

— The MRF heuristic is configured to break ties in favor of the page with 
the lowest request probability if the request probabilities of the pages are 
available to the scheduling algorithm. This version of the heuristic is termed 
Most Requested First Lowest (MRFL). 

— The heuristic which selects the page with the largest total waiting time of 
all pending requests is Longest Wait First (LWF). 

These heuristics are evaluated in [19] and it is concluded that when the system 
load is light, the mean response time is not sensitive to the heuristic used. This is 
due to the fact that in light loads, few scheduling decisions need to be made. On 
the other hand, when the system load is high and the page request probabilities 
follow Zipf’s Law [20], LWF has the best performance, whereas FCFS has the 
worst. 

Vaidya et al. have worked on data broadcast scheduling algorithms for push- 
based environments extensively and proposed several scheduling algorithms [10, 
11,13,14,17,18]. In [13], Jiang and Vaidya also investigate how the variance 
of response time can be minimized. The authors claim that their work and 
algorithms can be applied to pull-based environments as well. Therefore, we 
take their algorithms into consideration while devising our heuristic. 

The work which is most related to our work is the one performed by Ak- 
soy and Franklin [4,5]. The authors have proposed a scheduling algorithm which 
improves and unifies FCFS and MRF heuristics. They conclude that the LWF 
heuristic has the best performance results according to overall mean waiting 
time. However, the authors also point out that the straightforward implementa- 
tion of LWF is not practical. Aksoy and Franklin suggest to integrate FCFS and 
MRF in a practical way to combine their advantages and eliminate the disad- 
vantages. As a result, the authors propose the RxW heuristic which balances the 
selection criterion between the number of pending requests and the first request 
arrival time of a data item. RxW computes the product of the total number of 
pending requests ( R ) and waiting time of the first request ( W) of that data item, 
and selects the data item with the maximum RxW value. 

3 Bucketing Algorithm 

We have aimed to develop a scheduling algorithm that can minimize both the 
mean waiting time and its variance, as well as is robust to changes in mobile 
environment and has lower overhead. We describe a new heuristic that we name 
Approximate Total Waiting Time (ATWT). The proposed ATWT heuristic is 
implemented using a bucketing scheme and the resulting algorithm is termed 
Bucketing Algorithm. 



3.1 Approximate Total Waiting Time 

In order to decrease the amount of computation, we first assume that all requests 
for a page come at the same time as the first one. Thus, we only keep the arrival 




Evaluation of a Broadcast Scheduling Algorithm 185 



time of the first request for each page. When we need to compute total waiting 
time of a page, we can simply multiply the number of pending requests with 
the elapsed time since the first request arrived. This approximation gives us the 
upper bound of total waiting time of a page. Provided that requests arrival is 
governed by the Poisson process, if a page is broadcast r time units after the 
arrival of the first request to it, mean waiting time for pending requests for this 
page is | [6,17]. This fact gives an approximation to compute total waiting time 
for a page as follows: 



W P (t) = t —^*R p (t) (1) 

where t is the current time, A p is the first request arrival time, and R p (t) is the 
total number of pending requests for page p at time t. W p (t) is the approximate 
total waiting time for page p. The LWF heuristic needs to compute the total 
waiting time for every page to select the one with the largest value. We can 
drop the division by 2 in (1) to simplify the calculation since W p (t ) of each page 
will be compared. This finalizes the basic formulation, that we call Approximate 
Total Waiting Time (ATWT). ATWT enables us to record less information and 
do less computation 1 . 



Finding the Maximal ATWT. The direct implementation of the heuristic 
we propose above has a time complexity of 0(N ), where N is the total number 
of requested pages. In order to avoid the calculation of each requested page’s 
ATWT, we use a method which selects a few pages and calculates only their 
ATWTs to select the page with maximal ATWT value. Our implementation is 
based on a bucketing technique. We classify the pages according to the number 
of pending requests associated with them. All the pages that lie in bucket i will 
have pending request numbers ranging between 2 I_1 and 2*-l. The number of 
buckets is limited by the number of pending requests for distinct pages. There 
will be |"log(A + 1)] buckets of pages, where R is the number of pending requests 
of the most requested page in the system. In each bucket, the pages are ordered 
according to their first request arrival time. The first page of each bucket is the 
first requested page within that bucket. 

Whenever we need to find the page with the maximal ATWT, we compare 
only the ATWT values of the first pages of each bucket. Since the number of 
buckets is logarithmic with respect to the most requested page’s request number, 
we would examine very few candidates. The page with the largest ATWT value is 
selected among the first entries of all buckets. It can be shown that the bucketing 
scheme results in selecting a page with an ATWT value which is at least half of 
the maximum ATWT value. 



1 A formula similar to the approximation we provide for total waiting time has also 
been suggested by Aksoy and Franklin [5] but through completely different reasoning 
and observations from ATWT. 




186 M. Karakaya and O. Ulusoy 



3.2 Implementation of Bucketing Algorithm 

The data structure used for each requested page in Bucketing algorithm is il- 
lustrated in Fig. 1. Each bucket is a linked list of requested pages 2 . Pages are 
ordered in the linked lists according to the first request arrival time. Fields Prev 
and Next are pointers to the previous and next pages, respectively in the linked 
list. 





A 




Prev 


First Request 
Arrival Time 


Next 


Previous page 


R 


Next page 


according to 
A value 


Total number 
of pending 
requests 


according to 
A value 



Fig. 1. Page data structure 



Entries for pages are placed in buckets by mapping total number of requests 
to bucket number. A page with a total request number i is placed to bucket 

L lo g(*)J + 1- 

Bucketing algorithm works as follows: when a request arrives to the server, 
if it is the first request for the page, its arrival time is recorded to field A of the 
page data structure and the number of pending requests (R) is set to one. The 
page is placed at the end of the linked list in the first bucket since its R value 
is 1. 

Otherwise, if the page was requested and not yet broadcast, R is incremented 
by one. Then, if the page does not belong to the existing bucket anymore, it is 
moved to the appropriate bucket according to its R value. The page is then 
inserted in the linked list of this bucket with respect to its A value. 

In the selection of the page to broadcast, only the first page of each bucket 
is examined. The page with the largest ATWT is broadcast and removed from 
the bucket. The bucketing scheme reduces the decision overhead considerably 
without deteriorating the quality of the produced broadcast. 

We have also implemented a variant of Bucketing algorithm, called k-depth 
Bucketing algorithm, in which we examine the first k entries of each bucket. By 
comparing more entries in a bucket, it is expected to have more accurate ATWT. 

Minimizing the Variance of Waiting Time. We have investigated the vari- 
ance of waiting time produced by ATWT and several other heuristics. In [13], 
Jiang and Vaidya reformulate the algorithm presented in [17] considering vari- 
ance metric and propose a new algorithm called a-algorithm. The authors also 

2 For performance concerns, instead of using a linked list data structure, a heap data 
structure can also be implemented to store the items in each bucket. However, for 
the sake of simplicity we prefer to implement a linked list data structure in the 
simulation. 





Evaluation of a Broadcast Scheduling Algorithm 187 



claim that the algorithm can be adapted to pull-based systems as well. There- 
fore, we modify the computation of ATWT in our heuristic in a way similar to 
that suggested in [13] as follows: 

(t - Ap) a * R p (t) (2) 

where a can be assigned different values in order to tune variance of waiting 
time and mean waiting time. 



4 Simulation Results 

We have simulated the mobile environment introduced in Sect. 2.1. The simula- 
tion program was written in CSIM [16] . Due to lack of space we can not provide 
all the experiments and their results. For more details please refer to [15]. 

4.1 Simulation Model 

Our simulation model consists of three main components: a mobile support sta- 
tion (MSS), a population of mobile units (MUs) and communication channels. 
Published requests are kept in service queue. Online database stores the shared 
data items. The decision process is performed by a scheduling algorithm. Client 
population represents MUs within the cell. Communication channel is a two-way 
medium. In broadcast channel , selected data items are delivered to MUs, whereas 
backchannel is used to send data requests of MUs to MSS. 

Simulation parameters and their values are summarized in Table 1. dbSize is 
the total number of available data items at an MSS. Data items are numbered 
from 1 to dbSize, where a data item is, for example, a web page or a file. We use 
the terms data item and page interchangeably since the information server can 
be a database or web server. 

Table 1 . Simulation parameters 



Symbol Description 


Default Range 


Unit 


A 


Mean Req. Arrival Rate 


10 


[10-100] 


req. /tick 


& 


Request Pattern Skewness 1.0 


[0.1-1.0] 


- 


dbSize 


Database Size 


1,000 


[1,000-10,000] 


pages 


pSize 


Page Size 


1 


- 


tick 



Requests of MUs are represented by a single request stream. Request arrivals 
are assumed to be Poisson with a mean value of A. By increasing A, we can sim- 
ulate a higher system load. MUs may exhibit data locality, querying a particular 
subset of the database repeatedly [9,12]. This subset is a hot spot for an MU. 
In general, a user may request multiple items simultaneously and would expect 




188 M. Karakaya and O. Ulusoy 



to receive mutually consistent versions of the requested items. In this paper, 
similar to many of the past work, we consider the case where a user demands 
only one item per request, and unless the user gets the item, a new request is 
not initiated. In our work, the effect of transmission errors is not considered. We 
assume that when a data item is broadcast, all the users requesting that item 
receive it completely. It is assumed that access probabilities follow the Zipf [20] 
distribution over the database items as in many other related work (e.g., [3,5, 
19]). Data items are supposed to be ordered in the database according to their 
access probabilities in decreasing order, i.e., the most favorite data item is in 
the first place in the database. Zipf’s law states that the relative probability 
of a request for the ftlr most popular data item is proportional to i, where i 
is between 1 and clbSize. The Zipf distribution can be formulated to show the 
demand probability of each data item as below: 



Pi = 



(1A)° 

Eftl iz e(l/i)& 



(3) 



where 0 is a parameter termed access skew coefficient [18]. By changing the 
value of 0 , different Zipf distributions can be obtained. 

The time to broadcast a data item is calculated by a specific time unit called 
tick. We assume that page sizes ( pSize ) are equal and each page can be transmit- 
ted in one tick. The use of tick as a time unit enables us to compare easily the 
results of systems with different properties such as bandwidth and data item size. 
For evaluating a broadcast scheduling algorithm for a particular set of parame- 
ters, the broadcast schedule is produced for at least 30,000 cycles. Furthermore, 
we run each configuration ten times and use the averages as final estimates. 



4.2 Performance Criteria 

We have used the following performance criteria in our evaluations: 

— Waiting time of a request is defined as the duration from when the request 
is made until the desired data item begins to be transmitted on the channel. 

— Variance of waiting time is taken into consideration to evaluate the Quality 
of Service experienced by any user [13], where the overall mean waiting time 
is an indication of the idle time for the whole user population. 

— Worst waiting time is defined as the maximum amount of time that any 
user request waits before being satisfied. The reason to use this criterion 
is to check if the algorithm causes starvation of some requests, which is an 
important property for interactive applications [5] . 

— Decision overhead is the time taken by the computation which should be 
done for selecting a data item to broadcast next. The decision overhead of a 
good scheduling algorithm should not be high. 

4.3 Mean Waiting Time 

In this experiment, mean request arrival rate (A) is varied from 10 requests to 

100 requests per tick. As depicted in Fig. 2, mean waiting times for all algorithms 




Evaluation of a Broadcast Scheduling Algorithm 189 



are increasing while the request arrival rate is getting higher. However, after a 
certain rate, it levels off. In the figure, the results of MRF and FCFS are not 
presented. Because they have much larger mean waiting time than the other 
algorithms. LWF, RxW, and Bucketing algorithms are characterized by almost 
the same mean waiting time. The largest difference between any two of these 
three algorithms is not more than 0.8%. 




-A- LWF -*-RxW -e- Bucketing Algorithm 



Fig. 2. Mean waiting times of the LWF, RxW, and Bucketing algorithms 



Although LWF is a good algorithm in terms of mean waiting time, straight- 
forward implementation of it is not practical for large databases and high-speed 
broadcast channels. RxW and Bucketing algorithm are the only algorithms which 
are practical to implement and satisfactory in terms of mean waiting time results. 

We have conducted several simulation tests to record the impact of the dif- 
ferent database sizes and access skewness values. It is observed that the resulting 
waiting times of algorithms are relatively almost the same. As the database size 
increases, the mean waiting time increases as well. There is not much difference 
between the scalability of the algorithms with respect to database size. There- 
fore, concerning the time required to perform the simulation experiments, we 
preferred to use a default database size of 1,000 pages for all other experiments 
without losing generality. As the skewness of the Zipf distribution is increased, 
mean waiting time values of the algorithms, except FCFS, are getting consid- 
erably smaller. This result is due to the fact that the highly skewed request 
distribution (i.e., 9 > 0.7) leads to the existence of many pending requests to 
a few data items, and that broadcasting one of the most requested data items 
satisfies many pending requests. RxW, LWF and Bucketing algorithms take the 
number of pending requests into account and this property causes more efficient 
use of the broadcast channel. However, when 6 is 0.1, the distribution reduces to 
an almost uniform distribution, and each data item has almost the same pending 





190 M. Karakaya and O. Ulusoy 



requests. In that case, all the scheduling algorithms lead to almost the same mean 
waiting time. FCFS does not consider the pending request number in broadcast 
scheduling. Hence, the mean waiting time obtained with this algorithm does not 
improve much when the access skewness increases. 



4.4 Variance of Waiting Time 

In Sect. 2, we have modified our algorithm to handle the trade-off between mean 
waiting time and variance of waiting time. The ATWT value used as a selection 
criterion in Bucketing algorithm has been modified as in (2). 

To observe the effect of different a values both on variance of waiting times 
and mean waiting time of Bucketing algorithm we conduct several experiments. 
In these experiments, a parameter of (2) is varied from 0.5 to 3.0, while using the 
other default parameter values given in Table 1. We observe that the trade-off 
between mean waiting time and variance of waiting time is evident. For higher 
values of a, the variance is improving, on the other hand, the mean waiting 
time of the algorithm is getting worse. The mean waiting time of the algorithm 
improves while a is increased from 0.5 to a certain value, which is 0.9 in our 
experiment. Then, for the values larger than this threshold, the mean waiting 
time begins to worsen again. This result is due to the fact that for the a values 
higher than 1, the modified ATWT heuristic in (2) attaches more importance to 
the waiting time of the first request than the total number of pending requests 
of a page. Therefore, the heuristic selects the pages similar to those selected with 
FCFS. On the other hand, when a is set to values lower than 1, the heuristic 
behaves in favor of the most requested pages like MRF. As a result, as the a 
value gets smaller or larger than 1, the experienced mean waiting time becomes 
more similar to that of one of the two algorithms. 

We also conducted an experiment to compare the variance obtained with all 
scheduling algorithms. The results depicted in Fig. 3 show that FCFS has the 
lowest degree of variance due to the fact that in the worst case, the algorithm 
broadcasts any requested data item after broadcasting the whole set of data 
items in the database. That is, for the waiting time of a request, there is an 
upper bound which is determined by the database and page sizes. This upper 
bound also limits the variance of the waiting time. However, we can not claim 
this argument for the other algorithms. 

The performance results obtained after modifying the computation of ATWT 
in our algorithm as in (2) are presented in Fig. 3. In this experiment, we set 
the a value to 2. For the high workloads, the modified ATWT has even better 
variance of waiting time than that of FCFS. However, as discussed above, the 
mean waiting time of the modified ATWT has become slightly worse (see Fig. 4). 
With a greater value for a (e.g., 3), the variance of waiting time can be further 
decreased; nonetheless, the the mean waiting time would become worse. 




Evaluation of a Broadcast Scheduling Algorithm 191 




Bucketing Algorithm with a=2 



Fig. 3. Comparing Bucketing algorithm when a=2 




Fig. 4. Mean waiting time of Bucketing algorithm when a=2 



4.5 Worst Waiting Time 

The reason to investigate the worst waiting time is to check if a scheduling 
algorithm causes starvation of any request, which is an important property that 
should be avoided in interactive applications. Figure 5 displays the results for 
the longest waiting time experienced by any MU during the whole simulation 
time. For the default values of the simulation parameters presented in Table 1, 
FCFS has the lowest worst waiting time among all the algorithms. As discussed 
above, when FCFS is employed as the scheduling algorithm, any requested data 
item will be broadcast after the data items previously requested and the number 
of these data items is limited by the database size. In other words, the largest 




192 M. Karakaya and O. Ulusoy 



possible worst waiting time of a request is the time taken by broadcasting all 
the database items. However, for the other algorithms, it might be possible 
that a request waits while some of the data items are broadcast multiple times. 
Bucketing algorithm has considerably lower worst waiting time values compared 
to RxW. The results obtained with MRF are so much larger than those of the 
other algorithms that we do not include them in the figure. 




Fig. 5. Worst waiting time 



4.6 Scheduling Decision Overhead 

As discussed previously, a good scheduling algorithm should not have much 
scheduling decision overhead. In implementing the performance model, the de- 
cision overhead associated with each algorithm has not been considered. Since 
this overhead can not be accurately simulated for different types of algorithms, 
the time spent during the decision process has been ignored. For a comparative 
evaluation of decision overhead of the scheduling algorithms, we examined the 
number of requests scanned for selecting one of them to broadcast. If the number 
is large, the decision takes much more time and may become a bottleneck. 

We compared the average number of data items scanned by three algorithms: 
LWF, RxW and Bucketing. FCFS is not included in this experiment. It just 
broadcasts the request that has arrived first, and does not need to compare any 
entry. On the other hand, its overall waiting time is so bad that it is not a 
competitive algorithm to be used. 

As depicted in Fig. 6, LWF has the highest decision overhead while Bucketing 
algorithm has the lowest. The overhead of RxW is in between. Compared to 
other algorithms, Bucketing algorithm examines significantly fewer number of 
requests at each scheduling decision. For a request rate of 10 requests per tick, 





Evaluation of a Broadcast Scheduling Algorithm 193 




RXW -o- Bucketing Algorithm o I '."t- 

Fig. 6. Decision overhead 



LWF compares 130 times more entries and RxW compares 36 times more entries 
than that of Bucketing algorithm 3 . 



4.7 Improving the Bucketing Algorithm 

We have tried to improve the mean waiting time of our Bucketing algorithm and 
implemented the depth approach as presented in Sect. 3.2. There is a trade-off 
between the decision overhead and the mean waiting time in this approach. As 
we increase the search depth, we need to compare more entries of ATWT values, 
and we obtain lower mean waiting time. When we set the depth parameter to 
50, the resulting mean waiting time of Bucketing algorithm is less than that of 
RxW as can be seen in Fig. 7. 



5 Conclusion 

In this paper, the problem we attack is the design of a broadcast scheduling al- 
gorithm which efficiently meets the demands of a mobile computing environment 
and mobile users. We have first proposed a new broadcast scheduling heuristic, 
ATWT, which is an approximate version of LWF heuristic [19]. Then, we have 
developed an algorithm called Bucketing algorithm to implement ATWT heuris- 
tic by using a bucketing scheme. Finally, we have conducted extensive simulation 

3 In [5], approximate versions of the RxW algorithm are proposed and they are shown 
to lead to less comparisons in deciding which data item to broadcast. However, these 
approximate versions have worse mean waiting time compared to RxW. The Buck- 
eting algorithm, as discussed in Sect. 4.3, produces almost the same mean waiting 
times as RxW, while leading to less scheduling decision overhead. 




194 



M. Karakaya and O. Ulusoy 




Fig. 7. Bucketing algorithm with depth=50 and the other competitive algorithms 



experiments to evaluate the performance of our algorithm, and to compare the 
performance results against those of previously proposed scheduling algorithms. 

Considering the performance results, the first remark to be done is that the 
most competitive algorithm to our algorithm is RxW [5]. The other algorithms, 
except LWF, do not produce good results with respect to the main performance 
criterion, the overall mean waiting time. Although, LWF has better performance 
than all the others, it has serious drawbacks which prevent its practical usage. In 
terms of the other performance metrics, i.e., variance of waiting time and worst 
waiting time, the performance of Bucketing algorithm is better, in general, than 
that of all other scheduling algorithms. 

References 

1. Acharya, S., Alonso, R., Franklin, M., Zdonik, S.: Broadcast disks: Data manage- 
ment for asymmetric communication environments. Proceedings of ACM SIGMOD 
(1995) 199-210 

2. Acharya, S., Franklin, M., and Zdonik, S.: Balancing push and pull for data broad- 
cast. Proceedings of ACM SIGMOD Conference. Tuscon, Arizona. (1997) 

3. Acharya, S., M., Franklin, J., Zdonik, S.: Dissemination-based data delivery using 
broadcast disks. IEEE Personal Communications. 2(6) (December 1995) 50-60 

4. Aksoy, D., Franklin, M.: Scheduling for large-scale on-demand data broadcasting. 
Proceedings of the IEEE INFOCOM Conference. (1998) 651-659 

5. Aksoy, D., Franklin, M.: Rxw: A scheduling approach for large-scale on-demand 
data broadcast. ACM/IEEE Transactions on Networking. 7 (1999) 846-860 

6. Ammar, M., H., Wong, J., W.: The design of teletext broadcast cycles. Performance 
Evaluation. 5 (November 1985) 235-242 

7. Ammar, M., H., Wong, J., W.: On the optimality of cyclic transmission in teletext 
systems. IEEE Transactions on Communications. 35 (January 1987) 68-73 




Evaluation of a Broadcast Scheduling Algorithm 195 



8. Barbara, D.: Mobile computing and database: A survey. IEEE Transactions on 
Knowledge and Data Engineering. 11 (January-February 1999) 108-117 

9. Barbara, D., Imielinski, T.: Sleepers and workaholics: Caching strategies in mobile 
environment. ACM SIGMOD RECORD. 23 (May 1994) 

10. Hameed, S., Vaidya, N.,H.: Log-time algorithms for scheduling single and multiple 
channel data broadcast. Proceedings of the Third Annual ACM/IEEE Interna- 
tional Conference on Mobile Computing and Networking. Budapest, (September 
1997) 90-99 

11. Hameed, S., Vaidya, N.,H.: Efficient algorithms for scheduling data broadcast. 
Wireless Networks. 5 (1999) 183-193 

12. Imielinski, T., Badrinath, B., R.: Mobile wireless computing: Challenges in data 
management. Communications of the ACM (October 1994) 19-27 

13. Jiang, S., Vaidya, N., H.: Response time in data broadcast systems: Mean, variance 
and trade-off. Proceedings of International Workshop on Satellite-based Informa- 
tion Services (WOSBIS). (October 1998) 

14. Jiang, S., Vaidya, N., H.: Scheduling data broadcasting to “impatient” users. Pro- 
ceedings of the ACM International Workshop on Data Engineering for Wireless 
and Mobile Access. Seattle, WA USA, (August 1999) 52-59 

15. Karakaya, M., Ulusoy, O.: An efficient broadcast scheduling algorithm for pull- 
based mobile environments. Submitted to the IEEE/ACM Transactions on Net- 
working. 

16. Schwetman, H.: Csiml8 the simulation engine. In J., Charlies, D., Morrice, D., 
Brunner, J., Swain, editors. Proceedings of the 1996 Winter Simulation Conference. 
San Diego, CA, (1996) 517-521 

17. Vaidya, N., H., Hameed, S.: Scheduling data broadcast in asymmetric communi- 
cation environments. Wireless Networks. 5 (1999) 171-182 

18. Vaidya, N., H., Jiang, S.: Data broadcast in asymmetric wireless environments. 
Proceedings of First International Workshop on Satellite-based Information Ser- 
vices (WOSBIS). NY, (November 1996) 

19. Wong, J., W.: Broadcast delivery. Proceedings of The IEEE. 76 (December 1988) 
1566-1577 

20. Zipf, G., K.: Relative frequency as a determinant of phonetic change. XL. 
Reprinted from the Harvard Studies in Classical Philiology. (1929) 




An Architecture for Workflows Interoperability 
Supporting Electronic Commerce* 



Vlad Ingar Wietrzyk 1 , Makoto Takizawa 2 , and Vijay Khandelwal 1 



1 School of Computing, University of Western Sydney 

v . wietrzyk@uws . edu . au, Australia 

2 Department of Computures and Systems Engineering 

Tokyo Denki University, Japan 



Abstract. Distributed object architectures (DCOM, CORBA, Java RMI) 
coupled with Internet and Intranet technology have a great impact on 
process-centered environments and distributed workflow management. 
Management of the workflow operations of each organisation in the in- 
terworkflow must be done in a distributed environment. 

An electronic commerce (EC) process is a business process and defining 
it as a workflow provides all the advantages that come with this technol- 
ogy. This paper describes the design of a model as well as an architecture 
to provide support for distributed advanced workflow transactions. The 
componentwise architecture of the system makes it possible to incorpo- 
rate the functionality and thus the complexity only when it is actually 
needed. We discuss the application of transaction concepts to activities 
that involve integrated execution of multiple tasks over different pro- 
cesses. This kind of application is described as transactional workflows. 



1 Introduction 

Distributed workflow execution across functional domains is necessary, but distri- 
bution transparency is currently impossible because, different types of Workflow- 
Management-Systems (WFMSs) implement different WFMS metamodels. 

For smooth cooperation among organizations, the most important point is 
how the hierarchy of the cooperation process among organizations realizes. We 
can call the interoperation of workflows iterworkflow and the total support tech- 
nologies, which are necessary for its realization, interworkflow management mech- 
anism. Interworkflow is anticipated as a supporting mechanism for Business-to- 
Business Electronic Commerce. 

One possible way to enable distributed workflow execution is to build a 
workflow-management infrastructure integrating different and heterogeneous WFMSs. 
Users would have access to total funcionality because they access the workflow- 
management underlying infrastructure, not individual WFMSs. The resulting 
architecture is general and can accommodate as many WFMSs as required. 

* Research supported by the UWS, Versant Technology Corporation and Intel 
Corporation 

A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 196-209, 2001. 

© Springer-Verlag Berlin Heidelberg 2001 



An Architecture for Workflows Interoperability Supporting Electronic Commerce 197 

Transaction concepts have begun to be applied to support applications or ac- 
tivities that involve multiple tasks of possibly different types - including, but not 
limited to transactions, and executed over different types of entities - including 
DBMSs. The architect of such applications may specify inter-task dependencies 
to define task coordination requirements, and additional requirements for iso- 
lation, and failure atomicity of the application. Generally we will refer to such 
applications as multi-system transactional workflows. 

Workflow applications are long-duration applications since the duration of a 
workflow can range from a few hours to a few months. 

To summarize, the new aspects of our approach to distributed workflow 
database management systems supporting e-commerce include the following re- 
searh contributions. The novel approach to the notions of correctness for trans- 
action processing. Usually proposed the definition of correctness for multiuser 
databases are not necessarily suitable when these databases are parts of a mul- 
tilevel secure workflow systems. We believe, that the best approach will depend 
upon the characteristics of the multilevel workflow database and the applica- 
tions. 

Designing the high scalability and secure architecture enabling efficient e- 
commerce operations which allows an arbitrary combination of transaction sup- 
port systems and workflow management systems to which the locations are ir- 
relevant is an objective of our ongoing research. 



1.1 Outline of the Paper 

We have planned the presentation of the current research as follows. We first 
present a brief introduction to work on workflow transaction models and discuss 
extended - relaxed approach to handle workflow transactions in section 2. Section 
3 covers related aspects of workflow distribution and heterogeneity. A number of 
relaxed transaction models in workflow contexts that have been defined recently 
permitting a controlled relaxation of the transaction isolation and atomicity to 
better match the requirements of various workflow applications are discussed in 
section 4. In section 5 we consider two aspects of workflow interoperability: the 
interoperability protocol between independent WFMSs and the ability to model 
the interoperability in a workflow process definition tool. Section 6 concludes the 
paper with a summary and a short discussion of future research. 



2 Related Work 

Some known examples of extended transaction models include nested and multi- 
level transactions. A number of relaxed transaction models have been defined in 
the last several years that permit a controlled relaxation of the transaction isola- 
tion and atomicity to better match the requirements of various advanced appli- 
cations. Some examples of extended - relaxed transaction models are reported 
in [1,2], 



198 V.I. Wietrzyk, M. Takizawa, and V. Khandelwal 



In the WIDE project [3], a workflow is supported at two transaction levels: 
global and local. At the global level, the SAGA - based model offers relaxed 
atomicity through compensation and relaxed isolation by limiting the isolation 
to the SAGA steps. At this level of granularity the workflow activities are defined 
and therefore the grouped workflow activities follow the strict ACID properties. 
However, the flexibility in assigning transaction properties to workflow activities 
is limited to the extended SAGA or nested transaction model. 

Support for long-duration applications has been independently extended by 
practitioners and researchers focusing on workflow systems and transaction sys- 
tems. Extended transaction systems structure a large transaction into sub-transactions 
and execute them with additional constraints on the individual sub-transactions. 
Some researchers in workflow systems have proposed the notion of transactional 
workflow [4]. In transactional workflow environment, additional correctness re- 
quirements can be specified on top of traditional workflow specifications. 

The Workflow Management Coalition has specified a standard interface to 
facilitate the interoperability between different WFMSs [5]. However, they do 
not address transactional issues with the exception of writing an audit log. 

The transaction model used in the Exotica project [6] is based on the SAGA 
model, but relies on statically computed compensation patterns. As a result, its 
functionality is limited compared to the work presented in this research paper. 

Finally, most commercial products are designed around a centralized database. 
This database and the workflow engine attached to it — in most cases there is 
a single workflow engine are a single point of failure which quickly become a 
bottleneck and are not capable of providing a sufficient degree of fault tolerance. 

To summarize, databases and workflow management systems are comple- 
mentary tools within the corporate computing resources. Databases address the 
problem of storing and access data efficiently. Workflow management systems 
address the problem of monitoring and coordinating the flow of control among 
heterogeneous activities. 



3 Workflow Distribution and Heterogeneity 

Management of the workflow operations of each organization in the interwork- 
flow must be done in a distributed environment. Because the interworkflow de- 
fines the process of linking among different workflows, naturally the operations 
management takes place across different workflows. The interworkflow for inter- 
national procurement, for example, includes the procurement workflow and also 
the ordering/delivery management workflow of each supplier company; and these 
workflows are likely to be distributed across the procuring company’s system and 
the systems of each of the supplier companies. 

Workflow distribution introduces additional level of requirements. Because 
distributed workflow execution across heterogeneous WFMSs is currently not 
possible in a transparent way, we must to consider the problem of workflow 
funcionality isolation. 



An Architecture for Workflows Interoperability Supporting Electronic Commerce 199 

To efficiently define a work breakdown structure a functional decomposition 
by means of subworkflows is required. This in turn requires version and variant 
management, because each reused-workflow-definition alteration might lead to 
a new variant or configuration of the reusing workflow definition, if it necessi- 
tates to stay related to the old version. Dynamic changes of running workflows 
require a workflow’s functional decomposition to change. This might also involve 
replacing elementary workflow tasks with other composite workflow definitions. 
Workflow distribution is called homogeneous if the associated WFMSs are of 
the same type, heterogeneous otherwise. If the involved WFMSs are heteroge- 
neous, this might nessecitate data, (object)-type translation on the side of the 
requesting or receiving WFMS. A workflow is distributed when at least two of its 
objects reside in two different WFMS installations. This is relevant to workflow 
definitions as well as workflow instances. An often-cited situation is subworkflow 
distribution, where subworkflows are subject to excution on remote WFMSs. 
Some variants are possible, such as executing a subworkflow synchronously or 
asynchronously to the invoking workfow. One of the typical variant involves 
executing some part of a workflow on one WFMS, and continuing on another 
(see Figure 1). If the associated WFMSs, do not know about each other, it’s 




Fig. 1. Workflows Division across different WFMSs 



indirect distribution. In this case, the WFMSs do not implement distribution 
natively, and system designer must attach distribution functionality to the asso- 
ciated WFMSs. A recognised way is to establish communication buffers between 
the WFMSs, such as a database or persistent file stores. Figure 2 shows an 
example workflow definition with one distribution task. The distribution task 
invokes an application for buffer communication. Typically, workflow types can 
be distributed, too. 

3.1 An Architecture for Multilevel Secure Workflow Interoperability 

Global information management strategies based on a sound distributed ar- 
chitecture are the foundation for effective distribution of complex applications 




200 V.I. Wietrzyk, M. Takizawa, and V. Khandelwal 



that are needed to support ever changing operational conditions across security 
boundaries. What we need is a new Multilevel Secure Workflow (MLS) dis- 
tributed computing paradigm that can assist users at different locations and at 
different security levels to cooperate. 

A user can initiate a distributed workflow transaction at any site. If access to 
objects stored at remote sites is required, the distributed workflow transaction 
initiates a subtransaction at the remote site. To guarantee correct execution of 
distributed workflow transactions, each site in the distributed workflow database 
is under the operation of a concurrency control protocol and an atomic commit 
protocol. 

We present the fully distributed architecture for implementing a Workflow 
Management System (WFMS). An MLS workflow distributed database consists 
of a set A f of sites, where each site N £ AT is an MLS database. The sites in 
the workflow system are interconnected via communication links over which they 
can communicate. The WFMS architecture operates on top of a Common Object 
Request Broker Architecture (CORBA) implementation. A CORBA’s Interface 
Definition Language (IDL) is used to provide a means of specifying workflows. 
Also we assume that communication links are secure — possibly using encryp- 
tion. This distributed workflow transaction processing model describes mainly 
those components necessary for the distribution of a transaction on different 
domains. 




Fig. 2. The Distribution Task Invokes an Application for Buffer Communication 



Domain is a unit of autonomy that owns a collection of flow procedures and 
their instances. In practical terms, a domain might define the scope of a depart- 
ment or division in an organization. Therefore, flows are grouped by domains, 
and each domain also manages a set of flow procedures installed in the domain. 
A domain is not defined or limited by networks, processors, or peripherals. The 
manager of resources can, however, be designed in any fashion, they are ex- 
clusively responsible for the ACID properties on their data records. Solely the 
interface to the components of the distributed workflow model must exist. 




An Architecture for Workflows Interoperability Supporting Electronic Commerce 201 



If a transaction should be dstributed on several domains — a global trans- 
action, in every domain there must exist the following components, (see Figure 
3). 





Task, 






Task, 








T IDL 




j t IDL 






Task 1 




TSM , 












\ t N, 



I t ! 

Administration & Monitoring Service 



Workflow Database 



E 



Resource 

Managers 

(RMs) 



Semantic 

Transaction 

Manager 

(TM) 



TxRPC 

IDL 



Communication 

Resource 

Managers 

(CRMs) 



OSI TP 
I 



Communication with Another 
(Subordinator Node 



Fig. 3. Distributed Workflow Architecture 



— TM - Transaction-Manager. The transaction manager plays the role of the 
coordinator in the respective domain. If a transaction is initiated in this 
domain, the TM assigns a globally unique identifier for it. The TM monitors 
all actions from applications and resource managers in its domain. In every 
domain involved in the distributed workflow transaction environment there 
exists exactly one TM. 

— CRM - Communication-Resource-Manager. Multiple applications in the same 
domain talk with each other via the CRM. This module is used by applica- 
tions but also other management components for inter-domain communica- 
tion. CRM is the most important module with respect to the transactional 



202 V.I. Wietrzyk, M. Takizawa, and V. Khandelwal 

support for distributed workflow executions. Our model specifies the T*RPC 
as a communication model, which supports a remote procedure call (RPC) 
in the transactional environment. 

— RM - Resource-Manager. An accountable performer of work. A resource can 
be a person, a computer process, or machine that plays a role in the workflow 
system. This module controls the access to one or more resources like files, 
printers or databases. The RM is responsible for the ACID properties on 
its data records. A resource has a name and various attributes defining its 
characteristics. Typical examples of these attributes are job code, skill set, 
organization unit, and availability. 

— AMS - Administration Monitoring Service. The monitoring manager is used 
to control the workflow execution. In our approach, there is no centralized 
scheduler. In the figue, each Task Manager - designated as TSM, is equipped 
with a conditional fragment of code which determines if and when a given 
task is due to start execution. The scheduler communicates with task man- 
agers using CORBA’s asynchronous Interface Definition Language(IDL) in- 
terfaces. Task managers communicate with tasks using synchronous IDL 
interfaces as well. AMS module is also responsible for the coordination of 
the different sites in case of an abort that involves multiple sites. Individual 
task managers communicate to monitoring manager their internal states, as 
well as data object references - for possible recovery. 

The distributed architecture suits the inherent distributional character of 
workflow adequately in a natural way. 

This approach also eliminates the bottleneck of task managers having to 
communicate with a remote centralized scheduler during the execution of the 
workflow. This architecture also posseses high resiliency to failures — if any one 
node crashes, only a part of the workflow is affected. 

4 Relaxed Transaction Models in Workflow Contexts 

A number of relaxed transaction models have been defined recently that per- 
mit a controlled relaxation of the tranaction isolation and atomicity to better 
match the requirements of various workflow applications. Usually, we will refer 
to such applications as multi-system transactional workflow. This area has been 
also influenced by the concept of long running activities. 

The intenton is to merge advanced transaction technology and workflow man- 
agement systems to support business processes with well-defined failure seman- 
tics and recovery features. Our work is based on an interpretation of the workflow 
operations from the database point of view. 

As has been pointed out in [7], WFMSs lack the ability to ensure the correct- 
ness and reliability of the workflow execution in the presence of concurrency and 
failures. When the task is a transaction executed by a DBMS providing a full 
range of transaction management functions, we can take advantage of its con- 
currency control, commitment, recovery and access permit facilities. However, 



An Architecture for Workflows Interoperability Supporting Electronic Commerce 203 

when the task is executed by an application system, we must understand the 
application system semantics that affects its transactional behavior. 



4.1 Transactional Workflows 

Support for workflow applications has been addressed by researchers focusing on 
workflow systems and transaction systems. Extended transaction systems struc- 
ture a large transaction into sub-transactions and execute them with additional 
precedence requirements between start - commit - abort of the individual sub- 
transactions. Our approach falls in the category of transactional workflows [4] 
where additional correctness requirements can be specified on top of traditional 
workflows specifications. These requirements specify additional constraints on 
workflow execution schedules. Workflow management systems coordinate the 
execution of applications distributed over networks. The need for coordinated 
execution of workflow steps arises from application as well as data consistency re- 
quirements. Flexible transactions work in the context of heterogeneous distributed 
multidatabase workflow environments [8]. In such workflow environments, each 
database acts independently from the others. Because a local database can uni- 
laterally abort a transaction, it is not possible to enforce the commit semantics 
of global transactions. Therefore, flexible transaction were designed to address 
this problem. 



4.2 The Functionality of Flexible Transactions in Workflow Systems 

A flexible transaction is specified by providing: the precondition of the global 
transaction, a set of subtransactions, the externally visible states of each sub- 
transaction and the possible transitions among these externally visible states, 
preconditions and postconditions for the possible transitions of each subtrans- 
action, and the postcondition of the global transaction. 

The traditional transactions are usually characterized by the atomicity, con- 
sistensy, isolation and durability requirements, called the ACID properties of 
transactions. To better support workflow operational environment, the flexible 
transaction model relaxed the isolation and atomicity properties. This approach 
is the direct result of our believe, that tying a workflow system to a particular 
transaction model, will result in major restrictions that will limit its applicaility 
and usefulness as a workflow tool. 



4.3 A Formal Model of Flexible Transactions 

From a user’s point of view, a transaction is a sequence of actions performed on 
data items in a database. Flexible transaction model proposed for the distributed 
workflow environment will increase the failure resiliency of global transactions 
by allowing alternate subtransactions to be executed when a local database fails 
or a subtransaction aborts. The approach supports the concept of varied trans- 
actions allowing compensatable and noncompenstable subtransactions to coexist 



204 V.I. Wietrzyk, M. Takizawa, and V. Khandelwal 

within a single global transaction. This transactional environment allows a global 
transaction to have a weaker (relaxed) form of atomicity, termed semi-atomicity, 
while still maintaining its correct execution in the workflow. In a workflow mul- 
tidatabase environment, a local transaction is a set of subtransactios, where each 
subtransaction is a transaction accessing the data items at a single local site. The 
concurrency control of global transactions require, that each global transaction 
has at most one subtransaction at each local site [9]. Following [8,10], the def- 
inition of flexible transactions takes the form of a high-level specification. The 
flexible transaction model supports flexible execution control flow by specifying 
two kinds of dependencies among the subtransactions of a global transaction: 

— Execution ordering dependencies between two subtransactions. 

— Alternative dependencies between two subsets of subtransactions. 

In what follows, we shall formally describe the flexible execution control in 
the flexible transaction model. 

Let 17 = • • • ,t n } be a collection of subtransactions and 77(17) the 

collection of all subsets of 17. Let f;, tj £ 17 and 7), X) £ II {ft). Two types of 
control flow relations are defined on the subsets of 17 and on 77(17), namely: 

— precedence ti -< tj if ti precedes tj ( i j); 

— preferece T, t> Tj if ti is preferred to Tj ( i ^ j). If Ti t> Tj, we also declare 

that Tj is an alternative to Ti. 

Both of the above relations, precedence and preference are irreflexive and 
transitive or more formally, for each ti £ 17, -<(ti -< ti); and for each Ti £ 77(17), 
~<(Ti t> Tj). If ti ~< tj and tj -< tk, then ti -< tk] if Ti \> Tj and Tj \> Tk, then 
Ti t> T k . 

From he above definitions, we can see than, the precedence relations deter- 
mines the correct parallel and sequential execution ordering dependencies among 
the subtransactions, while the preferece relation determines the priority depen- 
dencies among alternate sets of subtransactions for selecting in completing the 
execution of 17. 

Now a flexible transaction can be defined as follows: 

Definition 1. Flexible transaction A flexible transaction 17 is a set of related 
subtransactions on which the precedence (-<) and preference (>) relations are 
defined. 

The semantics of the precedence relation refers to the execution order of 
subtransactions. For example, t\ -< t 2 may imply that £2 cannot start before 
1 . 1 finishes or that t 2 cannot finish before t\ finishes. By the same token, the 
preference relation defines alternative choices and their priority. For example, 
{ti} > {tj,tk} may imply that tj and tk must abort when ti commits or that 
tj and tk should not be executed if t, commits. In this environment, {tj} is of 
higher priority than {tj,tk} to be chosen for execution. 

We consider that a workflow database state is consistent if it preserves work- 
flow database integrity constraints. As it is the case for traditional transactions, 



An Architecture for Workflows Interoperability Supporting Electronic Commerce 205 

the execution of a flexible transaction as a single unit should map one consis- 
tent multidatabase workflow state to another. We designate relation (T;, -<*) as 
a partial order of subtransactions. (T i; -<*) is a representative partial order, if the 
execution of subtransactions in Tj represents the execution of the entire flexi- 
ble transaction 17. From the above it is clear that, if (Ti, -<*) is a representative 
partial order, then there are no subsets Tn and Ta of T, such that Tu t> T^. 
Because each global transaction has at most one subtransaction at a local site, 
each representative partial order of a flexible transaction must have at most one 
subtransaction at a local site. In our workflow execution environment, for flexi- 
ble transactions, the above definition of consistency requires that the execution 
of subtransactions in each representative partial order must map one consistent 
workflow multidatabase state to another. 



4.4 Scheduling of Flexible Transactions 

Since the flexible transaction model was proposed, much research has been de- 
voted to its application. The availability of visible prepare-to-commit states in 
local database systems is the basic assumption underlying this work. In such an 
operational environment, the preservation of the semi-atomicity of flexible trans- 
actions is relatively easy. As we mentioned in the previous subsection, failures 
of subtransactions in a flexible transaction are tolerated by taking advantage of 
the fact that a given function can frequently be accomplished by more than one 
database system. Also, time used in conjunction with subtransaction and global 
transaction can be exploited in transaction scheduling. 

A schedulable subtransaction may be submitted for execution to the transac- 
tion module. The scheduler first has to check for satisfaction of the preconditions 
for execution of each subtransaction — it determines whether a subtransaction 
is schedulable. This entails the specification of the execution dependency among 
the subtransactions of a global transaction. Execution dependency [4], is a re- 
lationship among subtransactions of a global transaction which determines the 
legal execution order of the subtransactions. To support the specification of the 
execution dependency, we define a transaction execution state as follows: 

Definition 2. The transaction execution state x for a global transaction T with 
m subtransactions, is an m — tuple (xi,X 2 , ■ ■ ■ , x m ) where: 

if ti is currently being executed; 
if subtransaction U has not been 
submitted for execution; 
if ti has successfully completed; 
if ti has failed or completed without 
achieving its objective; 

Under normal operational circumstnces transaction execution state is used to 
keep track of the execution of the workflow subtransactions. It is also used to 



' E 
N 



F 



206 V.I. Wietrzyk, M. Takizawa, and V. Khandelwal 

determine if a global workflow transaction has achieved its objectives. When a 
subtransaction L complete the corresponding execution state, a is set to S if the 
subtransaction has achieved its objective, and to F , therwise. At a certain point 
of execution, the objectives of the global workflow transaction may be achieved. 
At that point, the global transaction is considered to be successfully completed 
and can be committed. 

A number of approaches can be used to assure global serializability which 
constitutes a satisfactory correctness criterion for concurrent execution of multi- 
database workflow transactions, if there is a lack of additional information about 
their semantics. The objective of concurrency control is to assure that the serial- 
ization order of multidatabase workflow transactions should be the same, at all 
sites they execute. It was shown in [8, 11], that the above condition is sufficient 
to assure global serializability. However, in our workflow operational environ- 
ment this requirement can be relaxed to require that the relative serialization 
order of Workflow Transactions should be the same only at those nodes where 
they conflict. This would lead to a weaker notion of serializability; called WT- 
serializability, which will be used as our correctness criterion for concurrent execu- 
tion of Workflow Transactions. We define conflict among workflow transactions if 
they execute at the same (local) site, and they are not commutative. The conflict 
relation is transitive, and therefore determines a set of equivalence classes, which 
can be named as conflict classes. In our workflow environment they are used to 
determine the granularity of locking. In order to define workflow transaction se- 
rializability; WT-serializability, let us consider two workflow flexible transactions 
WT a and WTp, and conflict classes, i and j. A global schedule is WT-serializable 
if for any subtransactions STf and STf £ WT a , and STf and ST f £ WTp such 
that conflict {STf, STf) and conflict (STf, STf), ST? -< STf => STf -< STf, 
at all sites they conflict. In our workflow environment the -< relationship is de- 
fined in terms of local serializability. WT-serializability establishes a partial order 
among all workflow flexible transactions. The submission order at each system, 
can be used to determine the execution and, consequently, the serialization order 
at each site. Therefore, the concurrency control mechanism of the local system 
will assure that the transactions that are submitted to the local system will be 
executed correctly with respect to the local concurrency control. As a result, 
the lock held by a subtransation can be released as soon as the subtransaction 
completes its submission phase. Therefore, we will have several transactions that 
are executing concurrently at each local site. 

5 Interworkflow Management Mechanism 

Composing an MLS workflow from multiple single-level workflows is the only 
practical way to construct a high-assurance MLS WFMS today. In this particular 
approach, also the multilevel security of our MLS workflow does not depend on a 
single-level WFMS, but rather on the underlying MLS distributed architecture. 
On our platform, an MLS workflow becomes multiple single-level workflows that 
will be run on the MLS architecture. These independent workflows should work 



An Architecture for Workflows Interoperability Supporting Electronic Commerce 207 

together to achieve the intended operation. Therefore, workflow interoperability 
is an important issue. 

We should consider two aspects of workflow interoperability: 

— The interoperability protocol between independent WFMSs. 

— The ability to model the interoperability in a workflow process definition 
tool. 

The first issue can be defined and handled by a standards body such as OMG. 
However, the second aspect should be handled by each WFMS. OMG [12] intro- 
duces two models of interoperability. They are nested sub-process and chained 
processes as shown in Figures 4 and 5. 






J-U 


— — ^1 


=k 


H H 





Workflow A 



Workflow B 



Fig. 4. Nested sub-process 




Workflow A 



Workflow B 



Fig. 5. Chained processes 



In nested sub-process workflow design, a task in workflow A may invoke 
workflow B in a role as the actor of a particular task and then wait for it to 
complete. Therefore, the task in workflow A is a requester, and the task that is 
realized by the sub-process can act as the coordination point for interaction of 
the two workflows. 

In chained workflow design, a task may invoke another, then carry on with 
its own business processing logic. In this case, the task associated with the 
sub-process would be another entity that is stimulated by the results of the 
sub-process. The workflows will terminate independently of each other. The 
nested sub-process and chained processes models provide powerful mechanisms 
for workflow interoperability. 

However, we would like to extend the above two models to support a richer 
interoperability model: cooperative processes. Very offen there is a need to con- 
sider two independent autonomous workflows that need to cooperate across two 



208 V.I. Wietrzyk, M. Takizawa, and V. Khandelwal 



different organizations. Let’s assume that organization A is in control of work- 
flow A and Organization B is in charge of workflow B. Than, tasks in workflow 
A and workflow B can communicate and synchronize with each other as shown 
in Figure 6. Generally, two or more cooperating workflows may have indepen- 




Fig. 6. Cooperative processes 



dent starting and ending points. Let’s state a minimal set of information that 
is required for communication and synchronization among tasks in cooperating 
autonomous workflows. These should include: 

— protocol to establish the location and invocation method of tasks 

— protocol to coordinate receiving of replies 

In order to ensure the proper generation of the runtime code, the above specifi- 
cation has to appear in the workflow design. Such environment will guarantee, 
that execution will proceed according to the workflow control protocol. 

6 Conclusion 

The impetus for our current research is the need to provide an adequate frame- 
work for interworkflow as it is anticipated that it will be a supporting mechanism 
for Business-to-Business Electronic Commerce. We proposed the following aux- 
iliery functions as a necessary mechanism for implementing operations manage- 
ment of workflows in a disributed environment to support Electronic Commerce: 

— A mechanism for controlling execution of workflows designed on other work- 
flow management systems. 

— A mechanism for moitoring the status of workflows running on other work- 
flow management systems. 

— The global-business characteristic of Electronic Commerce workflows re- 
quires a high-throughput workflow execution engine. Therefore, to ensure 
this kind of scalability, load distribution across multiple workflow servers is 
necessary. 

— The necessity of efficient replication of workflow servers. This will ensure 
that the market position of the merchant will not be weakend due to frequent 
failures and unavailability of Electronic Commerce servers. 



An Architecture for Workflows Interoperability Supporting Electronic Commerce 209 

The notions of correctness for transaction processing that are usually pro- 
posed for multiuser databases are not necessarily suitable when these databases 
are parts of a multilevel secure workflow systems. We believe, that the best ap- 
proach will depend upon the characteristics of the multilevel workflow database 
and the applications. It is incumbent upon those who develop multilevel database 
systems to ensure that the user’s needs and expectations are met to avoid mis- 
understandings about the system's functionality. 

The insight developed in the current research serves as the basis for the next 
step, which is to build application specific software architecture that encode 
business logic for coordinating widely distributed manual and automated tasks 
in achieving enterprise level objectives. 

We choose to develop framework for distributed workflow architecture as 
the foundation of workflows interoperability supporting e-commerce. There are 
some aspects of workflows interoperability supporting e-commerce which needs 
to be addressed to make the distributed workflow architecture more useful. In 
our future work we will focus on providing support for adaptive workflow so that 
workflow can be altered at runtime. 



References 

1. V. Wietrzyk, Mehmet A. Orgun. A Foundation for High Performance Object 
Database Systems. In Databases for the Millennium 2000 in proceedings of the 9th 
International Conference on Management of Data, Hyderabad, December, 1998. 

2. A. Elmagarmid. Transaction Models for Advanced Database Applications. 
Morgan-Kaufmann, February 1992. 

3. G. Grefen, B. Pernici, G. Sanchez. Database Support for Workflow Mnagement - 
The WIDE Project. In Kluwer Academic Publishers, August 1999. 

4. M.Rusinkiewicz and A. Sheth. On transactional Workflows. Bulletin of the Tech- 
nical Committee on Data Engineering. (June 1993). 

5. The Workflow Management Coalition Interoperability Abstract Specification. The 
Workflow Management Coalition , June 1996 

6. G. Alonso and D. Agrawal Advanced transaction Models in Workflow Contexts. 
Procs. Int. Conf. on data Engineering. 1996. 

7. D. Georgakopoulos, M. Hornick, and A. Sheth. An Overview of Workflow Manage- 
ment: From Process Modeling to Workflow Automation Infrastructure. Distributed 
and Parallel Databases, 3(2) : 119-153, April 1999. 

8. A. K. Elmagarmid, Y. Leu, W. Litwin, and M. E. Rusinkiewicz. A Multidatabase 
Model for Interbase. In Proc. of the 16th VLDB Conference, August 1990. 

9. V. Gligor and R. Popescu-Zeletin. Transaction Maagement in Distributed Hetero- 
geneous Database Management Systems. In Information Systems, 11(4), 1986. 

10. A. Zhang, M. Nodine, B. Bhargava, O. Bukhres Scheduling with Compensation in 
Multidatabase Systems. In CSD-TR-93-063, 11(4), October 1993. 

11. M. Ansari, M. Rusinkiewicz, L. Ness, A. Sheth Executing Multidatabase Systems. 
In TM-TSV-019450, July 1991. 

12. Object Management Group. The Object Request Broker. Architecture and Speci- 
fication, 2000. 



Object and Log Management in Temporal Log-Only 
Object Database Systems 



Kjetil N0rvag* 

Department of Computer and Information Science 
Norwegian University of Science and Technology 
7491 Trondheim, Norway 
Kjetil. Norvag@idi . ntnu . no 



Abstract. We have previously studied the possible performance gain from using 
the log-only approach to realize temporal object database systems. Although the 
log-only approach in its basic form is relatively straightforward, it is not trivial to 
support features such as steal/no-force buffer management, fuzzy checkpointing, 
and fast commit. In this paper, we describe in detail algorithms and strategies for 
object and log management that make support for these features possible. 



1 Introduction 

Many emerging application areas for database systems demand features and performance 
difficult to support by todays systems. Common for many of these, is the need for 
management of complex objects, support for temporal objects, and support for querying 
changes (for example, updates since a particular time T). Examples of such application 
areas are XML/Web databases, geographical information systems and PACS (Picture 
Archiving and Communications Systems for use in health care). 

An interesting solution to support the desired features, is log-only object database 
systems. The log is written contiguously to the disk, in a no-overwrite way, in large blocks. 
This is done by writing many objects and index entries, possibly from many transactions, 
in one write operation. This gives good write performance, but possibly at the expense of 
read performance. This contrasts to most current object database systems (ODBs), where 
data is updated in-place. In order to support recovery and increase performance, write- 
ahead logging can be used. This logging defers the in-place update, but sooner or later, 
the update has to be done. This often results in the writing of lots of small objects, creating 
a write bottleneck. This bottleneck can be avoided by using the log-only approach. 

The log-only approach has many features that makes it interesting. Two examples are 
fast recovery and ability to benefit more from using RAID technology than traditional 
systems. A third feature, which is the motivation for our research, is the fact that keeping 
previous versions of objects comes almost for free when using a log-only approach. This 
is particularly interesting for transaction-time object database systems (TODB), where 
object updates do not make previous versions unaccessible. On the contrary, previous 

* The author supported in part by NFR (Research Council of Norway) and ERCIM (European 
Research Consortium for Informatics and Mathematics). 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 210-224, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




Object and Log Management in Temporal Log-Only Object Database Systems 211 



versions of objects can still be accessed and queried, and a system maintained timestamp 
(commit time of the transaction that created a version of an object) is associated with 
every object version. In a traditional system with in-place updating, keeping old versions 
of objects, which is required in a transaction-time temporal database system, usually 
means that the previous version has to be copied to a new place before update. This 
doubles the write cost. With the log-only approach, this is not necessary. 

Previous log-only object database systems have been page server based. While this 
works well in many contexts, it is not ideal. By operating on page granularity you get 
many of the disadvantages of traditional pager servers. For example, if clustering is bad, 
and only a small part of a page has been updated, it is still necessary to write back the 
whole page. With bad clustering, main memory buffer utilization will be bad as well. 
A page based log-only ODB also makes transaction management difficult. To avoid 
page level locking, you essentially need to have 1) a separate log anyway, or 2) use ad- 
hoc techniques to solve the problem. Both solutions are likely to hurt performance and 
increase complexity, and have together with performance analysis of log-only TODBs [3] 
convinced us that an object based log-only TODB is the way to go. 

Although the log-only approach in its basic form, and using page granularity, is 
relatively straightforward, more effort is needed in order to achieve performance and high 
concurrency. In this paper, we describe the Vagabond approach for efficient storage and 
management of temporal objects. With the Vagabond approach, some of the problems in 
related designs are avoided, and steal/no-force buffer management, fuzzy checkpointing 
and fast commit is supported. 

The organization of the rest of the paper is as follows. In Sect. 2 we give an overview 
of related work. In Sect. 3 we describe the Vagabond log-only approach. In Sect. 4 
we describe in detail the algorithms for the most important operations in the Vagabond 
approach. Finally, in Sect. 5, we conclude the paper. 



2 Related Work 



No-overwrite strategies have been used in shadow-paging recovery strategies earlier, 
e.g., in System R [1], but with the limited buffer size at that time, the performance was 
not satisfactory. POSTGRES [9] also employed a no-overwrite strategy, but had also 
its performance problems for several reasons, the most important being the buffer force 
strategy used. 

The Vagabond approach is based on the same philosophy as log-structured file sys- 
tems (LFS), which was introduced by Rosenblum and Ousterhout [7], LFS has been 
used as the basis for object stores, including the Texas persistent store [8]. Those object 
stores are page based, i.e., when an object has been modified, the whole page it resides 
on has to be written back. To our knowledge, there has been no publications on log-only 
ODBs operating on object granularity as in the Vagabond approach. 

A discussion of some topics related to TODBs can be found in [4]. More details 
about the Vagabond approach can be found in the author’s doctoral thesis [6]. 




212 K. N0rvag 



3 The Vagabond Approach 

In this section we give an overview of our approach to log-only TODBs. We give an 
overview of log management, storage objects, and indexing. 



3.1 Introduction to Log-Only Log Management 

With the log-only approach, already written data is never modified, new versions of the 
objects are just appended to the log. Logically, the log is an infinite length resource, 
but the physical disk size is, of course, not infinite. This problem is solved by dividing 
the disk into large, equal sized, physical segments. When one segment is full, writing 
is continued in the next available segment. As data is vacuumed, deleted or migrated to 
tertiary storage, old segments can be reused. Dead data, in a TODB most often old index 
nodes, will leave behind partially filled segments, the data in these near empty segments 
can be collected and moved to a new segment. This process, which is called cleaning, 
makes the old segments available for reuse. By combining cleaning with reclustering, 
we can get well clustered segments. 

A segment can be in one of three states. A segment starts in a clean state, i.e., it 
contains no data. The segment currently being written to, is called the current segment. 
When the segment is full, we start writing into a new segment. The new segment now 
goes from the clean state, to current. The previous segment is now dirty, it contains valid 
data (note that dirty in this context has nothing to do with main-memory state versus disk 
state, as the term is most frequently used). Information about the status of the segments 
is kept in the segment status table (SST), which is kept in main memory during normal 
operation. 

At regular times, a checkpoint operation is performed. In the checkpoint operation, 
we write enough information to the log to make the current position in the log a consistent 
starting point for recovery. The checkpoint information is stored in checkpoint blocks, 
which are stored in fixed positions in the log. 

Each new version of an object is written to a new place, making the use of logical 
object identifiers (OIDs) necessary. When using logical OIDs, an OID index (OIDX) 
is needed to do the mapping from logical OID to physical location when retrieving 
an object. The index entries in the OIDX, the object descriptors (OD), contains the 
physical address for an object and the commit timestamp. Note, however, that the ODs 
that are written together with the objects to the log contain transaction identifiers (TIDs), 
and the mapping from TID to commit time is written to the log as part of the commit 
operation. The ODs in the OIDX from committed transactions however, contains commit 
timestamps. 

In a non-temporal ODB with in-place updating of objects, the OIDX needs only to 
be updated when objects are created, not when they are updated. In a log-only ODB, 
however, the OIDX needs to be updated on every object update. This might seem bad, 
and can indeed make it difficult to realize an efficient non-temporal ODB based on this 
technique. However, in the case of a temporal ODB, the OIDX needs to be updated on 
every object update also if using in-place updating, because either 1 ) the previous or 2) 
the new version must be written to a new place. Thus, when supporting temporal data 




Object and Log Management in Temporal Log-Only Object Database Systems 213 



management, the indexing cost is the same in these two approaches. We have in previous 
work developed several techniques that can be used to reduce the OIDX access cost. 



3.2 Objects 

In our approach, all objects smaller than a certain threshold are written as one contiguous 
object. Objects larger than this threshold are segmented into subobjects , and a subobject 
index (which also takes care of versioning of a large object) is maintained for each of 
these large objects (this is done transparently for the user/application). The value of 
the threshold can be set independently for different object classes. This is very useful, 
because different object classes can have different object retrieval characteristics. 

Every object version in Vagabond has an associated object descriptor (OD), which 
contains the OID, physical location, commit timestamp (when the OD is not in the 
OIDX, the end timestamp of an object version is also included in the OD to reduce the 
cost of certain operations), and other administrative information. The entries in the leaf 
nodes of the subobject indexes are subobject descriptors (SODs). The SODs are also 
stored together with the subobject in the segments. The contents of an SOD include 
OID, physical location and write timestamp. While the timestamp in the OD denotes the 
commit timestamp of the actual object version, the write timestamp in the SOD is the 
time when the subobject was first written to a segment. This time is in general before the 
commit time of the actual transaction. When a subobject is moved because of cleaning, 
the write timestamp remains the same. The purpose of this, is to reduce the complexity 
of the cleaning. 

It is also possible to write delta objects instead of the whole objects, in order to 
reduce the amount of data that has to be written to the log (note that unlike traditional 
systems, that only use delta objects to reduce the log writing, a delta object in a log-only 
database system can be an object version on its own, i.e., the complete version will not 
necessarily be written). More details about delta objects can be found in [6]. 



3.3 OID Indexing 

In a traditional ODB, the OIDX is usually realized as a hash file or a B + -tree, with ODs 
as entries, and using the OID as the key. In a TODB, we have more than one version 
of some of the objects, and we need to be able to access current as well as old versions 
efficiently. Our approach to indexing is to have one index structure, containing all ODs, 
current as well as previous versions. For this purpose, either a traditional multiversion 
access methods like the TSB-tree can be used, or the Vagabond temporal OID index 
which is optimized for OID indexing in TODBs [5]. 

In a temporal ODB, the OIDX has to be updated every time an object is updated. 
This is a potential bottleneck. In order to reduce the index update and lookup costs 
in temporal ODBs, the persistent cache (PCache) approach can be used. The PCache 
contains a subset of the entries in the OIDX. The goal is to have the most frequently used 
ODs in the PCache. In contrast to the main-memory cache (the OD cache), the PCache 
is persistent, so that we do not have to write its entries back to the OIDX tree during 
each checkpoint interval. This is actually the main purpose of the PCache: to provide an 




214 



K. N0rvag 




Fig. 1 . Overview of the TIDX, PCache, and index-related main-memory buffers 



intermediate storage area for persistent data, in this case, the ODs. The result should be 
a reduced update and lookup cost for ODs. 

The size of the PCache is in general larger than the size of the main memory, but 
smaller than the size of the OIDX tree. The contents of the PCache are maintained 
according to an LRU like mechanism. The result should give higher locality on accesses 
to the PCache nodes, reducing the total number of installation reads. The average OIDX 
lookup cost will therefore also be less than without using a PCache. 

To avoid confusion, we will hereafter denote the OID index tree itself as the TIDX, 
and use OIDX to denote the combined index system, i.e., the PCache and the TIDX. 
Thus, when we say an entry is in the OIDX, it can be in the PCache, in the TIDX, or in 
both. This is illustrated in Fig. 1 . A more detailed description and performance analysis 
of the PCache can be found in [2]. 

4 Log- Only Database Operations 

In this section we present the algorithms for the most important operations in Vagabond. 
We will start with an overview and an introductory example of log writing, and continue 
with more detailed descriptions of the different operations in the rest of the section. 
The description in this section is based on using magnetic disk as secondary storage. 
However, except for the device which stores the checkpoint blocks (see Sect. 3.1), other 
storage technologies can also be used. 

Due to space constraints, the algorithms for vacuuming and cleaning are not described 
in this paper. For a detailed description of these, we refer to [6]. 

4.1 Introduction 

When a transaction is started, it is assigned a transaction identifier (TID). Unlike many 
other systems, it is not necessary to write any transaction start information to the log. In 





Object and Log Management in Temporal Log-Only Object Database Systems 215 



fact, if a transaction is aborted before it writes any objects to the log, there will be no 
trace left of the aborted transaction’s existence at all. 

An object that is written to the log, is always written together with its OD. Writing to 
the log is always done by writing large segments, which in general include objects from 
many transactions. The buffer system employs a steal strategy, so that modified objects 
can be written to the log before a transaction commits. This reduces commit time, as 
well as making it possible to handle large amounts of data in one transaction. 

When objects are written to the log before the transaction has committed, we do not 
know the commit timestamp. Therefore, ODs written to the log contains the TID instead 
of the timestamp. To be able to know if an OD contains a TID or a timestamp, TIDs 
always have the most significant bit set. The ODs written together with the objects in the 
log are only intended to be used if crash recovery is needed. In order to be able to know 
the timestamp of a committed transaction when doing recovery, a ( TID , t ime stamp ) 
tuple for the committing transaction is written to the log as a part of the commit. 

After a transaction commits, the objects that have been created or updated by the 
committing transaction become current versions. When another transaction later wants 
to read any of these objects, it have to first retrieve the OD of the object. This OD is either 
still in the OD cache, or has been installed into the OIDX. During normal operation, ODs 
are not discarded from main memory before they have been installed into the OIDX. 
ODs from a committed transaction are lazily installed into the OIDX after the commit 
has been finished. When the ODs are inserted into the OIDX, the timestamp is used in 
the OD, and not the TID. 

Most transactions have a short duration, and will be short enough to make it possible 
to write all its created ODs, and do the completion of the commit operation, during one 
checkpoint interval (between two consecutive checkpoints). In this way, all its ODs are 
guaranteed to be installed into the OIDX during the next checkpoint interval. This means 
that during recovery, we know that we only have to process log back to the penultimate 
checkpoint. 

If a transaction lasts longer than one checkpoint interval, we allow ODs generated 
by this transaction to be inserted into the PCache, even if the transaction has not yet 
committed. These ODs are stored as uncommitted ODs in the PCache nodes, and can not 
be used by any other transaction. In this way, when the transaction commits, all its ODs 
are guaranteed to be installed into the OIDX during next checkpoint interval. Some of 
its ODs in the PCache will at this point still be marked uncommitted, but during crash 
recovery we know which committed transactions have dirty entries in the PCache, so that 
these can be handled properly. Entries from aborted transactions will be lazily removed 
from the PCache nodes as they are retrieved from disk. Objects in the log from aborted 
transactions will simply be discarded the next time the segments are cleaned. 

Crash recovery is simple in a log-only database system. If crash recovery is needed, 
the last part of the log is scanned. All ODs generated from a transaction that commits 
during one checkpoint interval, should be installed into the OIDX when the next check- 
point interval ends, so that an upper bound exists for the amount of log that has to be 
processed in the case of crash recovery. ODs from committed transactions that are not 
yet installed into the OIDX are collected, and can later be installed into the OIDX. ODs 
from aborted transactions and transactions that were ongoing at crash time, are ignored. 




216 K. Nprvag 



Example. We will now illustrate log writing with the use of Fig. 2, which shows a 
number of transactions. On the top, we have the time-line, running from left to right, 
with checkpoints marked. The inserts of ODs into the OIDX is illustrated with arrows 
in the figure. Please note that 1) the length of a checkpoint interval depends on the 
workload, 2) the commit operation in the physical log is composed of suboperations 
(i.e., prepare, commit, and commit completed), and 3) even though the transactions are 
illustrated with separate lines, objects and ODs from different transactions can be stored 
in the same segments. 

Starting with transaction Tg, this is an ordinary short transaction. The transaction 
commits during the second checkpoint interval, and the ODs generated from this trans- 
action are guaranteed to be installed into the OIDX when checkpoint 4 ends. In the case 
of a crash after checkpoint 4, no ODs from transaction Tg will exist in the part of the 
log that have to be processed during recovery. 

Transaction TV spans more than one checkpoint interval, and is therefore treated as a 
long transaction. All the ODs written by transaction X7 during the first checkpoint inter- 
val, before checkpoint 2, must be installed into the PCache before the end of checkpoint 
3. This will be done in the background during the second checkpoint interval. After the 
transaction have committed, its ODs can be inserted into the T1DX as well. To empha- 
size: Before transaction T? commits, its ODs can only be inserted into the PCache, but 
after the transaction has committed, its ODs can be inserted into the TIDX as well as the 
PCache. 




Similar to the case of transaction T7, some of the ODs from transaction 7’-, written 
during the first checkpoint interval might have been inserted into the PCache before it 
aborts. If this was the case, they will be removed from the PCache in a lazy way, as time 
goes by. ODs (and objects) written to the log will be removed later, during the segment 
cleaning process. 

Transaction Tg commits during the third checkpoint interval, and its ODs might be 
inserted into the OIDX during the same interval, but this is not guaranteed to be finished 
until checkpoint 5 finishes, i.e., the second checkpoint after its commit. 





Object and Log Management in Temporal Log-Only Object Database Systems 



217 



We have now given a short introduction to the log generation, and we will now 
continue with a more detailed description of the operations. 

4.2 Object Operations 

A new OD is created every time an object is created, updated or deleted. In the case of 
an object create or update, the OD is written together with the object to the log, in the 
case of an object delete, it will be written to the log at a convenient time. In all cases, 
they will be written to the log before the transaction can finish the commit operation. 
The ODs will be inserted into the OIDX if the transaction commits. An OD will never 
be inserted into the TIDX itself before the actual transaction commits, but in the case 
of large transactions, some of the ODs can be inserted into the PCache (this will be 
described in more detail later in this section). Modified OIDX nodes will in general be 
written at a later time, so that the response time for a transaction commit can be short. 

The following description of the operations is mainly independent of which concur- 
rency control strategy is used. This means that we assume that in addition to performing 
the actions described below, concurrency control aspects are maintained. For example, 
if two-phase locking is used, we expect that the necessary lock(s) have been acquired 
before the actual operation is carried out. 



OD Management. Most transactions are short and update only a few objects. The ODs 
they have generated are usually not discarded from main memory until they have been 
installed into the OIDX. The ODs written together with the objects in the log will only be 
used if crash recovery is needed. All ODs resulting from a transaction committed during 
one checkpoint interval, should be installed into the OIDX before the next checkpoint 
interval ends. 

In the case of long 1 transactions, the number of ODs can be very high. If we require 
all ODs to be resident in the OD cache until they are inserted into the OIDX, the maximal 
size of a transaction would in this case be restricted by the size of the OD Cache. Another 
problem is transactions that last longer than a certain number of checkpoint intervals. 
All of the log created from the time when this transaction started to write to the log has 
to be processed during crash recovery. This can take a lot of time and it makes cleaning 
more complex, which in many situations is not acceptable. 

The problem with long transactions is solved by allowing ODs from uncommitted 
transactions to be inserted into the PCache (the reason for doing this, and other alterna- 
tives, are discussed in [6]). The extra cost is only marginal because PCache nodes are 
frequently read and written, and because most transactions commit, the PCache space 
wasted by this approach is low. 

We do not allow ODs from uncommitted transactions to migrate further from the 
PCache to the TIDX. The reason for this, is that this would complicate commit process- 
ing, recovery and it would also be costly. Also, it should not be necessary. In the case of 
very long transactions, the size of the PCache can be adaptively resized, so that its size 
does not limit the transaction size, 

1 When we here study long transactions, we mean both traditional long-living transactions, as 
well as “large transactions” (transactions that generate large amounts of data). 




218 K. N0rvag 



Management of ODs from Uncommitted Transactions. In order to avoid other trans- 
actions accessing ODs from the uncommitted transactions, and because ODs from un- 
committed transactions contain TIDs instead of timestamps, we need to know which ODs 
in the PCache nodes are ODs from committed transactions, and which ODs are from 
uncommitted transactions. This is achieved by using two binary trees inside the PCache 
nodes. One tree, the committed tree , is used for the ODs of committed transactions, and 
another tree, the uncommitted tree , is used for ODs from uncommitted transactions. 

When a transaction commits, the ODs it generated, and which have been inserted 
into PCache nodes, should be moved from the uncommitted trees to the committed trees. 
This is done lazily. Every time a PCache node is retrieved, all ODs from committed 
transactions that are still in the uncommitted tree are moved to the committed tree. 
When an ODs is moved, the TID in the OD is replaced with the commit timestamp of 
the transaction that generated the OD. 

We have to keep the TID of a committed transaction until all ODs stored in the 
PCache before it committed, have been moved to committed trees. For each committed 
transaction, we maintain a counter of how many ODs it still has in uncommitted trees 
in the PCache, and every time we move an OD from an uncommitted tree to a com- 
mitted tree, we decrement this counter. When the counter reaches zero, we can discard 
information on this transaction. 

The (TID, timestamp, counter) tuples are stored in a TID/timestamp/counter 
table (TTCT). The TTCT is written to the log during each checkpoint interval, in order 
to make it possible to reconstruct the table during recovery. Most transactions will be 
small, and ODs from these transactions will not be written to the PCache. Therefore, the 
size of this table will be relatively small. 



Creating and Updating Objects. When we create an object, it is allocated a unique 
OID, and an OD is created. A new OD is also created every time we update an object. 
The OD of the new version will eventually make its way into the OIDX if the transaction 
commits, as described previously. There is little difference between temporal and non- 
temporal objects in the case of create and update operations, the difference is mostly 
whether the old OD is kept in the OIDX. 

To ensure durability, an object has to be written to disk before its transaction commits. 
This is similar to a traditional system, where we need to have the necessary information 
written to the log before the transaction commits. The difference, however, is that in 
Vagabond, the log is the final place for the objects. 

The buffer system employs a steal strategy, which means that a modified object can be 
written to disk before the transaction commits. The object and its OD is written together 
with other objects and their ODs, possibly from other transactions, into a segment. We do 
not know the timestamp of a transaction before it commits, so in the log we always use 
the TID instead of the timestamp in the OD. This is also the case for ODs of uncommitted 
transactions when they are inserted into the PCache. 

When large objects are updated, only the modified subobjects and the affected 
subobject-index nodes are written to the log. The use of the Vagabond large-object index 
ensures that only the affected parts of the subobject index need to be written. Large 
objects are possibly spread over several segments, and therefore the writing of large 




Object and Log Management in Temporal Log-Only Object Database Systems 219 



objects has to be done carefully. This is done by first writing the updated subobjects, and 
then the modified parts of the subobject index. Note that the timestamps in SODs (see 
Sect. 3.2) are the time of segment write, not the commit time. The reason for this, is that 
in this way we do not have to update the subobject after the transaction has committed. 
The timestamps in the SODs are only used during cleaning, and for that purpose, the 
exact commit timestamp is not needed. 



Deleting Objects. Temporal objects are not physically deleted. In this respect, they are 
mostly treated as non-deleted objects. For example, during cleaning, a deleted object will 
be moved to the new segment, similar to any non-deleted object. A non-temporal object, 
on the other hand, can not be accessed after it has been deleted. It will be physically 
removed from the segment it resides in the next time the segment is cleaned. Whether 
an object is temporal or non-temporal, is stored in the object’s OD. 

Deleting Temporal Objects. Deleting an object which is defined as temporal, is done by 
writing a tombstone OD, which is an OD where the physical location is NULL, and the 
timestamp is the delete time. 

Note that whether the object is a large object or not, does not matter for the delete 
operation when the object is temporal. 

Deleting Non-Temporal Objects. If we do not want to keep the deleted version, i.e., it 
is not a temporal object, its OD is written to the log with both physical location and 
timestamp set to NULL. Unlike the tombstone OD, this OD is written to the log as 
logging information to be used in the case of recovery. When the OIDX is updated 
later, the OD for this object will be removed. When the object is deleted, the live-byte 
counter (in the SST, see 3.1) for the segment where the object resides, is decremented 
accordingly. 

If the object is a large object, the subobject index has to be traversed in order to 
decrease the live-byte count for the segments where the subobject-index nodes and the 
subobjects are stored. The subobject-index nodes and the subobjects will be deleted next 
time the respective segments are cleaned. 

Deleting New Objects. If an object is deleted by the same transaction that created it, the 
effect on the database should be the same as if the object had never been created. This 
is assured by using the following algorithm: 

1. If the object has not yet been written to the log, the only action needed is to remove 
the OD from the OD cache and delete the object from the main-memory buffer. 

2. If the object has been written to the log, but its OD is still in the OD cache and is 
dirty with respect to the OIDX, the only action needed is to write a tombstone OD 
to the log. If a crash occurs, the recovery algorithm will know that an object deleted 
by the same transaction that created it, should be discarded. 

3. If the object has been written to the log, and the OD in the OD cache is clean or the 
OD is not resident in the OD cache, that means that the OD has been inserted into a 
PCache node. In this case, the OD has to be removed from the PCache node. In order 
to avoid a synchronous operation, a tombstone OD is created and inserted into the 




220 K. N0rvag 



OD cache. The OD in the PCache node is removed the next time the PCache node 
is brought into main memory. The tombstone OD that is inserted into the OD cache 
has to be written to the log before or during commit, if the PCache node has not 
been updated before that time. 



Reading Objects Stored as Complete Versions. We now describe how to retrieve 
current as well as historical object versions. The physical location of an object version is 
stored in its OD, and in order to read an object that is not resident in memory, we have to 
first retrieve its OD. If the object is a large object, the location in the OD is the location 
of the root of the subobject index of the actual object version, and this subobject has to 
be traversed in order to retrieve the requested subobject(s). 



Current Versions. When we want to read the current version of an object, we first check 
if the OD is resident in the OD buffer. If not, we have to do a lookup in the OIDX. 
An OIDX lookup is done by first searching the PCache, and if the OD is not found in 
the PCache, the TIDX is searched. When doing a lookup in a PCache node, it is only 
necessary to search the uncommitted tree in the PCache node if it is possible that the 
object has been previously modified by the same transaction that is now requesting the 
object. If a locking protocol is used, this can only be the case if the transaction already 
owns a write lock on this object (or in the case of hierarchical locking, a lock for a larger 
granularity, for example a container). 



Historical Versions. If we know the timestamp of the historical version we want to 
retrieve, the lookup for the OD and retrieval of the object can be done in the same way 
as when reading the current version of an object. However, quite often the query is for 
an object version valid at a certain time tj. In this case, we have to retrieve the OD 
with the largest timestamp less than or equal to tj. It is this operation that makes an 
end timestamp in the OD beneficial when the OD is outside the TIDX (see Sect. 3.2). 
If we did not have the end timestamp in the OD, it would not be sufficient to access the 
OD cache or the PCache to find the OD. Even if we found an OD in the OD cache or 
PCache with a timestamp tj that is close to tj, there could have been updates between 
tj and tj . This would be impossible to know from the ODs alone, and we would have to 
do a lookup in the TIDX for every such retrieval. 



4.3 Transaction Management 

In order to be able to do recovery after a failure, it is necessary to ensure that enough 
information has been written to the log before a transaction commits. We have previously 
described how objects can be written to disk before a transaction commits, in order to 
avoid writing all the objects modified by the transaction in one burst during commit, 
and how we write the ODs to the log in order to avoid synchronous updates of OIDX 
nodes at commit time. We will now describe in more detail transaction management in 
Vagabond. 




Object and Log Management in Temporal Log-Only Object Database Systems 221 



Commit. The transaction commit operation can in principle be implemented by writing 
objects from the transaction that is still dirty in the object buffer to the log, followed 
by a transaction finished mark which includes a (TID , timestamp) tuple. After the 
objects and the transaction finished mark has been written to the log, the transaction 
commit is considered finished. Objects, ODs, and the transaction finished mark can be 
stored in the same segment, and more than one transaction can be committed in one 
segment write (similar to traditional group commit). In this way, the response time can 
be as low as the time it takes to write one segment, and this technique should in most cases 
give good throughput in a single server system. However, it is difficult to implement an 
efficient 2-phase commit operation by using this simple technique. 2-phase commit is 
crucial in a multi-server system, and in the Vagabond approach we use a more elaborate 
commit protocol, where more information is written to the log in the various phases of 
the commit process [6], 



Abort. In a log-only database, we do not need to undo operations when we abort a 
transaction. If the transaction that wrote an object does not commit, an object written to 
the log before the abort operations will simply be a dead object, which will be removed 
the next time the segment is cleaned. 

No ODs reflecting updates from a transaction will be installed into the TIDX until 
after the commit has been done, ODs with OIDs that have been allocated by a transaction 
that has aborted will never be inserted into the TIDX, and the OIDs are not reused later 
by any other transaction. ODs from aborted transactions that have been inserted into the 
PCache will be removed lazily at the same time as ODs from committed transactions 
are moved to the committed tree (see Sect. 1). 

When a transaction aborts, the live-byte counts for the segments where the objects 
were written are decremented accordingly. This can only be done immediately for the 
objects where ODs are still in main memory. For those objects where the ODs have 
been removed from the OD cache and inserted into the PCache, the live-byte counts are 
decremented when the ODs are removed from the PCache nodes. 



4.4 Recovery 

When the system is restarted, it is determined from the checkpoint block whether the 
shutdown of the system was done controlled, or caused by a crash. If caused by a crash, 
recovery is needed. We now describe how to reduce the recovery time by checkpointing, 
and take a closer look at recovery and how to handle media failures. 



Checkpointing. The main purpose of checkpointing is to reduce the recovery time. This 
is achieved by bounding the amount of log that has to be processed at recovery time. In 
a traditional database system, the main part of the checkpoint process is to write dirty 
pages back to disk, usually by the use of a fuzzy checkpointing technique. In a log-only 
system, this is not the important issue. The log is the final repository, and objects will 
have to be written to the log before commit in any case. In Vagabond, the main issue 
of checkpointing is to install the ODs into the OIDX. This is done to avoid having to 




222 K. N0rvag 



read excessive amounts of log at recovery time (in order to find ODs that have not been 
installed into the OIDX before the system crashed). 

During a checkpoint interval, between the checkpoints, ODs from committed trans- 
actions are installed into the OIDX. To keep the installation rate high enough and reduce 
the amount of memory needed to store ODs not yet installed into the OIDX, all ODs 
from a transaction committed during one checkpoint interval, should be installed into 
the OIDX no later than the end of the next checkpoint interval. 

The segment status table (SST), PCache Status Table (POST), and the 
TID/timestamp/counter tables (TTCT), are resident in main memory, and in order to 
be able to recreate these after a crash, the contents of these tables are written regularly 
to the log. This is done by writing a certain range of entries from these tables each time 
we write a segment. During each checkpoint interval, all SST, POST and TTCT entries 
should have been written at least once. 

Checkpointing can be costly, so it is important that the amount of data to be written 
at checkpoint time is as low as possible, and that data structures locked as a part of the 
checkpointing process are locked only for a short time. In Vagabond, most operations 
can run as normal during checkpointing. The only restriction is that the timeout values 
for 2-phase commit should be smaller than one checkpoint interval. The checkpoint 
algorithm is as follows: 

1. Wait until the number of written objects since last checkpoint, or the number of 
segments written since last checkpoint, reaches a certain threshold. 

2. If there are ODs created before the last checkpoint that are not yet installed into the 
OIDX, stop all other log processing until this has been done. Note that this delay is 
undesirable, and can normally be avoided by giving high enough priority to OIDX 
updating. To reduce a possibly long checkpointing time when this situation occurs, 
it is possible to solve the problem temporarily (or rather postpone the problem) 
by simply writing to the log the dirty ODs that have not been written since last 
checkpoint. 

3. If there are entries in the SST, POST, or TTCT, that have not been written during 
this checkpoint interval, write them to the log now. 

4. Update the least recently written checkpoint block. Now the checkpointing is fin- 
ished, and normal operation continues. 



Crash Recovery. The purpose of crash recovery is to reconstruct a consistent state. In 
a traditional system, this is a very complex operation, and typically involves an analysis 
phase, a redo phase, and an undo phase. In a no-overwrite system like Vagabond, undo or 
redo of objects is not necessary. However, ODs and transaction management information 
is written to the log, and this information has to be read in order to rebuild the resident 
structures. 

The first step in a recovery is to identify the last segment that was successfully 
written before the crash. This is done by reading the log from the last checkpoint until 
1) we come to a segment that was only partially written (the system crashed when it was 
writing this segment), or 2) the next segment of a segment does not exist (the address of 
the next segment is determined before a segment is written so that it can be stored in the 




Object and Log Management in Temporal Log-Only Object Database Systems 223 



currently written segment, in this way, if the next segment does not exist we know that 
the system crashed in the interval between writing two segments). 

When we read the log in order to find the end of it, we also collect all ODs, and keep 
each OD where we later find a commit operation for the transaction that generated that 
OD. For ODs that we do not find a commit, these ODs can be safely discarded because 
the system crashed before the transaction was committed. 

After we have identified the end of the log, and processed the part of the log written 
after the last checkpoint, we read the log from the last checkpoint and backwards until the 
penultimate checkpoint. In this way, we process all segments that might have ODs from 
committed operation that have not yet been installed into the OIDX. This backward 
reading can be done efficiently, because all segments have a pointer to the previous 
segment. In addition, all segments also contains the number of ODs in the previous 
segment in the log. This makes it possible to only read the part of the segment that 
contains the ODs, the rest of the segment can be skipped. 

While reading the segments, we rebuild the relevant structures in memory. If we need 
to write index nodes during recovery because of insufficient buffer capacity, this is done 
to clean segments. When the log has been processed, we do a checkpoint, and when 
the checkpoint process is finished, the checkpoint blocks are updated. Idempotence is 
guaranteed because we do not modify any written data before updating the checkpoint 
block. If a system crashes during recovery, it will simply start recovery in the same way 
next time. 

Media failure in a log-only system can be handled by the use of mirroring (RAID 1) 
or RAID with parity blocks (for example RAID 4 or RAID 5). The use of mirroring 
will also improve read performance, because the read bandwidth is doubled. The write 
performance will stay the same, 

5 Conclusions and Further Work 

The log-only approach for TODBs has been shown to be promising with respect to 
performance [3], However, although the log-only approach in its basic form is relatively 
straightforward, achieving performance and high concurrency is not straightforward. In 
this paper, we have described in detail how this can be achieved, resulting in a design that 
avoids some of the problems in previous, related designs. The results include support 
for steal/no-force buffer management, fuzzy checkpointing and fast commit. 

The results from this paper together with our other work in this context, has shown 
that the log-only approach is feasible. Some ideas that we would like to explore further 
are the realization of a valid time or bitemporal TODBs using the log-only approach. We 
also plan to adapt the ideas to temporal object-relational database system, which could 
make the results even more interesting. 




224 K. Nprvag 



References 

1. J. Gray, P. McJones, M. Blasgen, B. Lindsay. R. Lorie, T. Price, F. Putzolu, and I. Traiger. The 
recovery manager of the System R database manager. ACM Computing Suiyeys, 13(2), 1981. 

2. K. Nprvag. The Persistent Cache: Improving OID indexing in temporal object-oriented data- 
base systems. In Proceedings of the 25th VLDB Conference , 1999. 

3. K. Nprvag. A comparative study of log-only and in-place update based temporal object database 
systems. In Proceedings of the Ninth International Conference on Information and Knowledge 
Management ( CIKM'2000 ), 2000. 

4. K. Nprvag. Design issues in transaction-time temporal object database systems. In Proceedings 
of ADBIS-DASFAA'2000 , 2000. 

5. K. Nprvag. The Vagabond temporal OID index: An index structure for OID indexing in temporal 
object database systems. In Proceedings of the 2000 International Database Engineering and 
Applications Symposium (IDEAS), 2000. 

6. K. Nprvag. Vagabond: The Design and Analysis of a Temporal Object Database Management 
System. PhD thesis, Norwegian University of Science and Technology, 2000. 

7. M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file 
system. In Proceedings of the Thirteenth ACM Symposium on Operating System Principles, 
1991. 

8. V. Singhal, S. Kakkad, and P. Wilson. Texas: An efficient, portable persistent store. In Pro- 
ceedings of the Fifth International Workshop on Persistent Object Systems, 1992. 

9. M. Stonebraker. The design of the POSTGRES storage system. In Proceedings of the 13th 
VLDB Conference, 1987. 




Operations for Conceptual Schema 
Manipulation: Definitions and Semantics 



Helle L. Christensen, Mads L. Haslund, Henrik N. Nielsen, and 
Nectaria Tryfona 

Department of Computer Science, Aalborg University 
Fredrik Bajersvej 7E, 9220 Aalborg 0st, DENMARK 
{arnaq, levi ,moth, tryf ona}Scs . auc . dk 



Abstract. This paper concerns the issue of conceptual schema integra- 
tion and manipulation, i.e., the process of manipulating highly abstract, 
semantically rich database schemas for the extraction of meaningful and 
unambiguous results. We propose the use of six fundamental manipu- 
lation operations, namely rename, select, project, union, set difference, 
and intersection. We give their definition and semantics using the Entity- 
Relationship modeling notation. We argue that in order to preserve the 
semantics of the ER-schemas before and after the operations are per- 
formed, these schemas need to be translated to mathematical formu- 
lations; then the manipulation operations can be applied. We use the 
ACCQI description logic notation which features a rich combination 
of constructors, powerful enough to express elements of ER-schemas. 
We show that the resulting knowledge bases, which encapsulate all the 
knowledge about the ER-schemas, need further structuring to accom- 
modate the semantics of the ER elements. Examples demonstrate the 
applicability and efficiency of the proposed approach. 



1 Introduction 

The need for integration and further manipulation in the area of databases and 
data warehouses has been well-documented in literature [9], [11], [12]. The pro- 
cess of integration relates mostly to source integration (i.e., when the sources are 
merged), schema integration (i.e., when conceptual and/or logical schemas are 
integrated), data integration (i.e., the process of comparing source data sets and 
creating a reconciled view), and information integration (i.e., when metadata 
or views are merged), while the process of manipulation includes issues such 
as pre-integration, schema comparison, schema union, schema intersection, and 
schema restructuring [10]. 

In this paper we deal with conceptual schema manipulation, i.e., the process 
of manipulating highly abstract, semantically rich schemas with set operations 
and operations comparable to the ones from relational algebra, for the extraction 
of meaningful results. These schemas may represent either different applications 
of the same domain, e.g., a patient and a drug trial schema for the corresponding 
databases of a pharmaceutical company, or just different excerpts of the same 
database. 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 225—238, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




226 



H.L. Christensen et al. 



The motivation behind this research effort is based on the argument that 
in order to build applications out of existing ones we need to raise the level of 
abstraction of operations on schemas, models and mappings [1]. Our goal is a 
methodology that supports conceptual schema manipulation in an unambiguous 
manner. 

More specifically, in this work, having the Entity-Relationship (ER) as the 
pilot conceptual model, we introduce the semantics and usage of six schema 
manipulation operations , namely rename, select, project, union, set difference, 
and intersection, for the manipulation of ER.-schemas. The semantic definition 
of these operations is not enough to ensure their unmistakable application on 
conceptual schemas. A mathematically sound formulation, which captures the 
semantics of both the ER.-schemas to be manipulated and the applied opera- 
tions in a formal way, is needed. For this, we use Description Logic (DL) for 
the unambiguous representation of ER-schemas, as it is particularly well-suited 
for specifying data classes and relationships among classes and equipped with 
both formal semantics and inference engines [5] . The DL we use, called ACC QZ, 
for the representation of the ER.-schemas, features a rich combination of con- 
structors powerful enough to express elements of ER.-schemas. The expressions 
of ACC Off form Knowledge Bases (KB) about classes and relations, which en- 
sure the preservation of the semantics for both ER.-schemas and the applied 
operations. 

The rest of the paper is organized as follows. Section 2 describes representa- 
tive related work on the issues of schema integration and manipulation as well 
as the use of DL in this area. Section 3 gives the semantics of the schema manip- 
ulation operations and shows their use on ER.-schemas. Section 4 describes the 
way DL is used to preserve the semantics of the ER.-schemas to be manipulated. 
The presented approach has been tested by manipulating the Data Management 
(DM) and the International Product Safety (IPS) database schemas of the 
Danish pharmaceutical company Novo Nordisk A/S; examples from this area 
demonstrate the applicability of our proposal. Section 5 concludes and draws 
future research directions. 



2 Related Work 

The approaches that have been developed in the area of integration differ with 
respect to the merging of elements of the used models. Moreover, the use of 
DL, to ensure preservation of semantics during manipulation, has been broadly 
adopted in the area of integration. 

According to [2] three types of description logic assertions can be used to de- 
termine relations between elements. Elements can only be merged if they fulfill 
the requirements set by the user, e.g., elements must be 85% equal to be merged. 
This process of creating a global view over a database might create conflicts e.g., 
homonym conflicts where entities have the same name but are interpreted differ- 
ently. In [4] a solution on how data integration conflicts can be resolved through 
suitable matching, reconciliation, and conversion operations is explained. The so- 




Operations for Conceptual Schema Manipulation: Definitions and Semantics 227 



lution is based on a conceptual representation of the application domain which 
helps in the integration and reconciliation process. 

The research areas of integration and manipulation are explored in different 
directions. According to [3] two main approaches for integration exist: the pro- 
cedural and the declarative. In the procedural approach data is integrated in an 
ad-hoc manner with respect to some predefined requirements. Several projects 
follow this approach, such as the TSIMMIS, Squirrel , and WHIPS, which all are 
explained briefly in [10]. The declarative approach on the other hand works to- 
wards modeling data in a suitable language to construct a unified representation. 
There are also tools that supports this approach, such as the Carnot , SIMS, and 
Information Manifold which are also explained briefly in [10]. Another way to 
facilitate the process of integration is presented in [8] which describes a graph- 
ical tool called i*com. This tool facilitates the process of conceptual modeling, 
integration of databases, and uses a description logic inference engine to check 
possible consistencies of the created schemas. 

Our work presents a methodology to manipulate ER-schemas in an unam- 
biguous way with the help of DL. We do not aim towards the implementation 
of a graphical tool, like i«com; we rather envision a mechanism which can be 
included on top of an already implemented system. 



3 The Schema Manipulation Operations 

In order to efficiently manipulate ER.-schemas, we introduce the use and se- 
mantics of six fundamental operations. These operations, namely rename, se- 
lect, project, union, set difference , and intersection, were chosen having as basis 
the fundamental operations from the relational algebra. They are constructed 
based on a set of guidelines that specifies choices made about integration of 
ER-schemas, for example, rules for merging schema elements, i.e., entities, rela- 
tionships, attributes. (We chose to use the terms entity, and relationship, instead 
of entities, and relationships, respectively). 

The merging of schema elements during the manipulation process is based on 
the concepts of entity-equivalence, attribute-equality, and relationship-equality. 
More specifically, we allow the merging of two entities from two ER-schemas if 
they have the same name or if the user specifies them to be the same (entity- 
equivalence); we allow the merging of two attributes if they belong to the same 
entity and have the same names (attribute-equality). Two relationships are equal 
(relationship-equality) if they relate the same entities with the same cardinalities. 

Additionally, we make the assumptions that the ER-schemas to be integrated 
and manipulated are logically connected, i.e., belong to the same domain and 
concern the same topics, e.g., sales, production, or patients. There exist cases 
which are difficult to address, such as synonym cases, e.g., entities may have syn- 
onym names and only some of their attributes are the same. We handle them by 
allowing the user to specify equivalent entities and ISA-relationships between en- 
tities in different ER-schemas. The notation used to specify equivalence between 
entities is EntityffSi) —spe Entity 2 (S 2 ) , which denotes that Entityi from the 




228 



H.L. Christensen et al. 



ER-schema Si is specified to be the same as Entity 2 from the ER.-schema S 2 . 
The notation used to specify a ISA-relationslrips between entities in different 
ER.-schemas is Entity\(S\) Cg Entity 2 (S 2 ), which denotes that Entityi from 
the ER.-schema S\ is a specialization of Entity from the ER.-schema S' 2 . 



3.1 An Application Example 

We tested the applicability of our approach in excerpts of two ER.-schemas used 
in the pharmaceutical company Novo Nordisk A/S. The two schemas correspond 
to the Data Management (DM) database which keeps track (among others) 
of products, trials, patients, and events, and the International Product Safety 
(IPS) database which records serious events. The products are tested in trials 
involving test persons, which are referred to either as patients or as subjects. 
During a trial a patient can suffer adverse events. An event can be e.g., a slight 
bruise, an allergic reaction, or death. A subset of these events is serious adverse 
events. A serious adverse event can be e.g., anaphylactic shock or death. The 
IPS ER.-schema regards data about the adverse serious events while the DM 
ER.-schema regards data about all events. Figures 1 and 2 illustrate excerpts of 
the two schemas. 



id 




Fig. 1. The IPS ER-schema 



It is obvious that the two database schemas have many common values in 
basic elements (e.g., patients, drugs, etc.) and the need for their integration is 
apparent. As described previously, the user can specify equivalent entities and 
ISA-relationslrips in the two ER.-schemas. The following shows the specifications 
of equivalent entities and ISA-relationships for the two ER.-schemas: 

— Master(IPS) =spe Trial(DM), Patient(I P S) =spe Subject(DM) 

— SA-Event(IPS) Cg Event(DM) 



3.2 The Semantics of the Six Schema Manipulation Operations 

The schema manipulation operations operate on a universe of discourse, which 
we define to be the set 12 that consists of all ER.-schemas. We further define the 









Operations for Conceptual Schema Manipulation: Definitions and Semantics 229 



create time 




date verify_id status 

Fig. 2. The DM ER-schema 



set V , that consists of all predicates P. With these definitions, it is possible to 
define the domain and range of the schema manipulation operations. The domain 
and range for the six operations is summarized in Table la. A description of the 
semantics of the six operations follows. 



The Rename Operation. The rename operation receives an ER-schema and 
a predicate as input and returns a new ER.-schema; the predicate is a comma 
separated list containing tuples, where each tuple contains the current name 
(. froniName ) of the entity and its new unique name ( toName ). The new ER- 
schema contains all the entities which were in the original ER.-schema. The 
entities which were specified to be renamed, are renamed according to the used 
predicates. The semantics of the rename operation are summarized in Table lb. 
An example of the rename operation is that of renaming the entity Subject 



Table 1. (a) Domain and range for the operations. ( b ) The rename operation 




in the DM ER.-schema, which according to the specification in Sect. 3.1 can 
be renamed to Patient. The syntax for the example, where entities in the DM 
ER.-schema are renamed, is: rename ((Subject, Patient)) (DM). 












230 



H.L. Christensen et al. 



The Select Operation. The select operation receives an ER-sclrema and a 
predicate as input and returns a new ER-sclrema; the predicate consists of a list 
of names, specifying the entities to be selected. The resulting ER.-schema consists 
of the entities specified in the predicate and any entities related to the entities 
specified in the predicate. The semantics of the operation are summarized in 
Table 2a. An example of its use is illustrated in Fig. 3a. 



The Project Operation. The project operation receives an ER.-sclrema as 
input and returns a new ER.-sclrema that consists of the entities specified in 
the predicate and the relationships relating these entities. The semantics of the 
project operation are summarized in Table 2b. An example of the use of the 
project operation applied to the ER.-sclrema of Fig. 1 is illustrated in Fig. 3b. 



Table 2. (a) The select operation. ( b ) The project operation 




Project p (Si) — > S 2 

1. Copy all entities specified in the 
predicate P from Si to S 2 . 

2. Copy all relationships in Si which 
in S 2 are related to two or more en- 
tities. 



(b) 



id 




seriousness 



(a) 



(b) 



Fig. 3. (a) SeleetTrial(DM). (b) project M aster,SA-Event{IPS) 



The Union Operation. The union operation receives two ER-schemas as in- 
put and returns a new ER.-schema containing the two original schemas merged 
together. Equivalent entities are merged into one entity with a union of the 










Operations for Conceptual Schema Manipulation: Definitions and Semantics 231 



attributes of the original entities. It is common, when unifying schemas, to 
have two entities, one in each schema, which are semantically related by the 
ISA-relationslrip. Users can specify ISA-relationships between entities in the 
two schemas as mentioned in Sect. 3.1. This information is further stored in 
a list. This list, called the ISA-list , contains ISA-tuples , each specifying an 
ISA-relationship between entities from two ER-schemas. When two ER.-schemas, 
that have an ISA-relationship specified between entities, are merged, a new re- 
lationship is added. This new relationship connects the two entities, from the 
ISA-relationship, in the new ER.-schema. The semantics of the union operation 
are summarized in Table 3a and an example of use of the union operation can be 
seen in Fig. 4. The relationship isa* represents the relationship which is added 



create time 




date date verify_id status 

seriousness 

Fig. 4. union(D M , I P S) 



in item 7a of the semantics of Table 3a, while the relationship has* represents the 
has relationship from the IPS schema, which has been renamed to be unique. 



The Set Difference Operation. The set difference operation receives two 
ER.-schemas as input and returns a new ER.-schema consisting of all the entities 
in the ER.-sclrema Si minus the entities that exist in the ER.-schema 82 - This 
implies that the operation simply removes the intersection of entities between the 
two ER.-schemas from the first ER.-schema. The semantics of the set difference 
operation are summarized in Table 3b and an example of the set difference 
operation can be seen in Fig. 5a. Note that the set difference operation is not 
commutative. 



The Intersection Operation. The intersection operation receives two ER.- 
schemas as input, and returns a new ER.-schema consisting of the entities and 
relationships which exist in both of the original ER.-schemas. The semantics of 
the intersection operation are summarized in Table 3c and an example of use 
of the intersection operation can be seen in Fig. 5b. Note that the intersection 











232 



H.L. Christensen et al. 



operation takes the union of attributes of each common entity and adds then to 
the entity appearing in the new ER-schema. 




id 

create_time 



prod_name 




(a) (b) 

Fig. 5 . (a) set difference(D M , I P S) . (b) intersection^ PS, DM) 



Table 3 . (a) The union operation. ( b ) The set difference operation, (c) The intersection 
operation 



Union ( Si, $2 ) — » S3 

1 . Copy Si to S3 

2 . Copy S2 to S4 

3 . Find all relationships in S3 and S4 
that have the same name, but are 
not equal. 

a) Change the names of 

relationships found in S4, so 
they are unique in both S3 
and S 4 . 

4 . Find all entities from S3 and S4 
that are equivalent. 

a) Add unique attributes from 
entities in S 4 to entities in S3. 

5 . Add unique entities from S4 to S3. 

6. Add unique relationships from S 4 
to S3. 

7 . For each tuple in the ISA-list do: 

a) Add a new relationship to S3. 

(a) 



Set Difference (Si,S2) -> S3 

1 . Copy Si to S3 

2 . Remove all entities from S3 which 
have equivalent entities in S2. 

3 . Remove all relationships which be- 
long to zero or one entities, along 
with any references to these rela- 
tionships, from S3. 



w 

Intersection (Si, S2) — » S3 

1 . Find all equivalent entities in Si 
and S2 

a) Add union of the two entities to 
S3. 

2 . Add all equal relationships from Si 
and S2 to S3. 



(c) 



4 Description Logic and the Manipulation Operations 

In order to ensure that the application of the manipulation operations to schemas 
returns unambiguous results, we need to define the ER.-schemas in a clear, formal 
way, with unique interpretations. For this, we adopt the use of the description 













Operations for Conceptual Schema Manipulation: Definitions and Semantics 233 



logic ACC QL and the translation of ER-schemas to KBes, which is described in 
[5] , although the notation is inspired by [7] . The reason for choosing ACC QL is 
that it provides a broad formal framework, which is useful for the preservation 
of semantics and the expression of the ER.-schemas elements. We first give a 
brief introduction to ACCQL (Sect. 4.1) and then describe the mapping and 
translation of ER-sclremas to ACCQL KBes (Sect. 4.2 and 4.3). For further 
preservation of semantics, the KBes need to be structured to accommodate the 
elements of the ER-schemas (Sect. 4.4). 

4.1 The Description Logic ACCQL 

The basic types of ACCQL are concepts and roles. Concepts and roles belong 
to a domain A, which consists of a set of instances. The domain A could e.g., 
be Citizen of Denmark , where the instances are people living in Denmark. 

A concept is a subset of the domain A and it can be either atomic or com- 
posed. Composed concepts are expressions built by applying concept construc- 
tors on a set of atomic concepts. An atomic concept is a class in the domain A 
and it can be chosen to be any of the classes in the domain, but some of the 
classes in the domain are more useful as atomic concepts than others. A concept 
in the domain Citizen of Denmark could be male , which consists of all male 
people in the domain, whereas a composed concept could be female = -i male. 

A role is a binary relation between concepts of the domain A and is, like a 
concept, either atomic or composed. Composed roles are expressions made by 
applying role constructors to a set of atomic roles. A role in the domain Citizen 
of Denmark could be parent-of, which relates all parents in the domain Citizen of 
Denmark with their children. A composed role could be child_of = —> parent-of, 
which relates all children with their parents. 

The syntax for the constructors (both concept and role constructors) is shown 
in the syntax part of Table 4, where C and C' are concepts and R is a role. 
The semantics of ACCQL are given as sets on the domain A. A concept C is 
interpreted as a set of instances on the domain A, while a role is interpreted as a 
set of pair of instances of the domain A. Formally, an interpretation L = (A x , - 1 ) 
consists of a nonempty set A 1 and a function - x that maps every concept to a 
subset of A x , every role to a subset of A x xA x and every individual to an instance 
of A 1 . The set A x is called the domain, while - x is called the interpretation 
function. The semantics of all the constructions are shown in the semantics part 
of Table 4. 

The constructors in Table 4 are used to form expressions in ACCQL. A set 
of expressions in ACCQL forms a KB. A KB encapsulates knowledge about a 
specific domain, expressed in terms of concepts and roles. An ACCQL KB is 
constituted by a finite set of assertions. There exist two types of assertions in 
ACCQL: inclusion and equality assertions. The assertions consist of atomic con- 
cepts A and C, where C can be an atomic or composed concept. These assertions 
represent each knowledge about the domain. The syntax of the inclusion asser- 
tion is A C C, while the semantics is ( A C C) x = ( A 1 C C x ). The syntax of the 
equality assertion is A=C , while the semantics is ( A = C) x = ( A x = C x ). 




234 



H.L. Christensen et al. 



Table 4. The syntax and semantics rules of ACCQT 



Syntax 


Semantics 


p 

0 

1 




-'C 1 


(-.Cf = A X \C X 


cnc' 


(Ci n c 2 ) x = Cf n c? 


CuC' 


(Ci u c 2 ) x = cf u c? 


MR.C 


(VR.C) 1 = (o G A x \Mo’. <o, c/>€ R 1 -> o' e C 1 } 


3 R.C 


(: 3R.C) 1 = {o G A x \3o' . <o, o'>G R x A o' G C 1 } 


3- n R.C 


(3^ n R.C) x = {oG A x \#{o'. <o, o'>G R 1 A o’ G C 1 } > n} 


3- n R.C 


C3~ n R-C) x = {o G A x \#{o'. <o, o'>G R x A o’ G C 1 } < n} 


R — > P | 




P- 


( P~) x = (<o, o'>G A x x A x | <o',o>G P x } 



4.2 Mapping between ER and ACCQI 

The ER model cannot be mapped directly to ACC QZ as it consists of other types 
of basic elements than ACC QL . The ER.-model consists of entities, relationships, 
attributes, and ER.-roles, while ACCQI consists of concepts and roles [5]. The 
elements from the ER.-model should be modeled as DL elements to create an 
ACCQI descriptive logic knowledge base [5]. A KB is a formal representation 
that encapsulates all knowledge about a given ER.-schema and enables the iden- 
tification of all the original elements of the ER-schema. The basic elements of 
the ER model are modeled as elements of ACCQI (Table 5). The function cmin s 
maps an ER.-role to a non-negative integer, whereas the function cmax s maps 
an ER.-role to a non-negative integer in the interval from zero to infinity. 



Table 5. Mapping between the ER model components and ACCQI components 



Basic element 
in ER-model 


Modeled as 


Notes 


Entity 


Concept 


- 


Attribute 


Role 


Relates an entity to the domain of the attribute. 


Relationship 


Concept 


Relationships must have unique names within 
the ER-schema, due to constrains in ACCQI. 


ER-role 


Role 


Relates a relationship and an entity. 


Cardinality 


Role 


Uses the functions cmin 3 and cmax a to model 
upper and lower bounds on the role. 






Operations for Conceptual Schema Manipulation: Definitions and Semantics 235 



4.3 Translation from ER to A.CCQI 

The translation function 0 translates an ER-schema into a ACCQZ KB. An 
ACCQZ KB 0(5), representing the ER.-schema 5 , consists of an atomic concept 
(j>{A ) for each entity, relationship or domain symbol A in 5 , and an atomic role 
<j>(R) for each attribute symbol or ER.-role R in 5. 

In [5] the translation from an ER-schema to an equivalent ACCQZ KB is 
done through five translation rules. Each rule translates a different part of the 
ER.-schema into description logic expressions. Only three out of the five rules 
from [5] are useful in our context. These are: 

1. For each entity E with attributes A\,. . . , Ah with the domains D \, . . ., Dh 
respectively, insert the assertion: 

0(E) c V0(A 1 ).0(Ti 1 ) n • • • n V0(A).0(E /t ) n 3 =1 0(^i) n • • • n 3 =1 0(^) 

2. For each relationship R of arity k between entities E \, . . , , E^, to which R is 
connected by means of ER.-roles U\,...,Uk respectively, insert the assertions: 

<1>{R) E V0(C/i).0(E!) n • • • n V0(t/ fc ).0(E fe ) n 3 =1 0(£0) n • • • n 3 =1 0(£4) 
0(Ej) C V(0(f7i))“.0(i?), for ie {1 

3. For each ER.-roles U associated to relationship R and entity E, 

- if m = cmins(U ) 0, then insert the assertion: 0(E) C 3- m (0(17))~ 

- if to = cmaxs(U) yf oo, then insert the assertion: 0(E) C 3- m (0(/7)) _ 



4.4 Structuring the Knowledge Bases 

Even though a KB encapsulates all knowledge about a given ER.-schema and 
enables the identification of all the original elements, the structure is unfit for lo- 
cating specific elements. The KB is therefore further structured into a Structured 
Knowledge Base (SKB). A SKB contains two lists, one that contains knowledge 
about entities in the KB, called the Entity Concept list (EC-list), and another 
that contains knowledge about relationships in the KB, called the Relationship 
Concept list (R.C-list). 

The EC-list has an entry for each entity in the ER.-schema. Each entry in 
the EC-list consists of the name of the entity that it represents and two lists. 
One list, the relationship-expression list (RE-list), contains information about 
the relationships that the entity is related to and the other list, the domain- 
expression list (DE-list) , contains information about the attributes of the entity. 

The R.C-list has an entry for each relationship in the ER.-schema. Each entry 
in the R.C-list consists of the name of the relationship that it represents and 
one list. The list of the relationship entry, called the relationship-expression list, 
contains information about the entities that the relationship is related to. The 
SKB for an ER.-schema S has the following structure: 

Structured knowledge base 0(5): 

EC = (ECi,EC 2 , ...,EC n ), where EC) = {RE ECi , DE ECi ) 

RC = (RCpRC 2 , RC m ), where RC { = (. RE RCi ) 




236 



H.L. Christensen et al. 



A KB is structured into a SKB by dividing all the assertions in the KB into 
explicit assertions. The assertions of the KB are made explicit by applying the 
following transformation, which can be proven by induction: 



A c Ci n • • • n c„ <^> (AcCi)n--n(Acc„) 



Each of these explicit assertions is then grouped into the SKB based on the 
following three rules: 

— If the concept of the constructor is represented by a domain symbol Dj, 
which denotes the union of all attribute domains, then the assertion belongs 
to the domain-expression list of the entity specified in the assertion. 

— If the role of the constructor is represented by an inverse role, then the 
assertion belongs to the relationship-expression list of the entity specified in 
the assertion. 

— If none of the two above rules apply, then the assertion belongs to the rela- 
tionship list. 

An ER-schema and its equivalent KB are shown in Table 6. The ER.-schema 
is an excerpt of Fig. 2, with the addition of the ER.-roles ht and h.p. This KB 
can then be structured into the SKB shown in Table 7. When an ER-sclrema 
has been translated into an ACCQX KB, the schema manipulation operations 
can be applied. Each one of the operations is realized in terms of an algorithm 
that takes as input the appropriate KBes and returns a KB. With basis in the 
semantics for the schema manipulation operations and the structure of the SKB, 
it is a simple task to specify algorithms that implement the schema operations. 
The algorithms that have been created are presented in [6]. 



Table 6. An excerpt of the IPS ER-schema with the addition of ER-roles and its 
equivalent KB 




(j>{Product) E V(/>(prodMct_name).Z?i n 3 1 <f>(productjname) n 
V(f>(hp))~ .(/>(Has) n 3 L (rj>(hp))- 
4>(Trial) C M(j>{id).D 2 n 3 =1 (j>(id) n V(<j>(ht)y .<j>(Has) 
cp(Has) Q V(p(hp).(j> (Product) 13 3 ~ 1 cf)(hp) 13 
V (j>{ht) .<p(Trial) 13 3 _1 0(/it) 



5 Conclusions and Future Work 

We have presented the definitions and semantics of six fundamental opera- 
tions for conceptual schema manipulation, namely rename, select, project, union, 






Operations for Conceptual Schema Manipulation: Definitions and Semantics 237 



Table 7. The SKB for the ER-schema shown in Table 6 



Entity Concept 


RE-list 


DE-list 


Product 


> m 


( y (product jname) .Difl 
3<j>(product-name ) ) 


Trial 


i^/(j){ht)) =l .4>(has) 


{V<t>{id).D 2 r\3= L <j>{id) 


Relationship Concept 


RE-list 




Has 


( y4>{hp).4>{Product ) n 3 =L <f(hp)), 
( V4>{ht).<t>{Trial ) n 3 =l <j>{ht)) 



set difference, and intersection, as one step towards a methodology for semi- 
automatic schema management. We have shown the way these operations ma- 
nipulate ER-schemas. We use the ER.-model as a prototype environment; we are 
working towards showing that object oriented [13] and other semantic models 
can be used as framework, once their elements and properties have been defined 
formally. 

In order to make sure that we preserve both the semantics of the operations 
and the manipulated schemas, we adopt the use of ACjCQX- More specifically, 
we translate ER.-schemas into KBes that encapsulate all their semantics and 
use the KBes as input to the applications of algorithms that implement the 
six operations. The output is a KB that represents a correct and unambiguous 
schema. 

We have applied our approach to the real environment of the pharmaceutical 
databases of Novo Nordisk A/S, experiencing useful results, with no inconsis- 
tencies. 

We are currently working towards the implementation of a mechanism han- 
dling the aforementioned proposal on top of the Designer and Repository tool 
of Oracle. We are also working on allowing semantically richer the ER-schemas, 
e.g., by diagrammatically allowing ISA-relationships and aggregated entities. 
Furthermore, we are considering the extension of the operations towards a more 
general use; for example, the rename, select, and project operations can also be 
applied on relationships if accompanied by the appropriate constraints. 

Acknowledgments. The authors thank Novo Nordisk A/S for supporting this 
research and providing the case study environment, and Luca Aceto for his help 
with DL. 



References 

1. Philip A. Bernstein. Panel: Is Generic Metadata Management Feasible? In Proc. 
of the 26th Inti. Conference On Very Large Data Bases, pages 660-663, 2000. 

2. A. Bonifati, L. Palopoli, D. Sacca, and D. Ursino. Discovering Description Logic 
Assertions from Database Schemes. In Proc. of the Inti. Workshop on Description 
Logics - DL-97, pages 144-148, 1997. 




238 



H.L. Christensen et al. 



3. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Daniele Nardi, and 
Riccardo Rosati. Information Integration: Conceptual Modeling and Reasoning 
Support. In Proc. of the 3rd IFCIS Inti. Conference on Cooperative Information 
Systems, New York, USA, pages 280-291. IEEE-CS Press, August 20-22 1998. 

4. Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Daniele Nardi, and 
Riccardo Rosati. A Principled Approach to Data Integration and Reconciliation 
in Data Warehousing. In Proc. of the Inti. Workshop on Design and Management 
of Data Warehouses(DMDW’99), volume 19, June 1999. 

5. Diego Calvanese, Maurizio Lenzerini, and Daniele Nardi. Logics for Databases and 
Information Systems, chapter 8, pages 229-263. Kluwer academic publishers, 1998. 

6. Helle L. Christensen, Mads L. Haslund, and Henrik N. Nielsen. Operations for 
Schema Integration, Definitions & Semantics. Technical report, Aalborg University, 
2001 . 

7. Francesco M. Donini, Maurizio Lenzerini, Daniele Nardi, and Andrea Schaerf. Rea- 
soning in Description Logics. In Foundation of Knowledge Representation, pages 
191-236. CSLI-Publications, 1996. 

8. Enrico Franconi and Gary Ng. The i»com Tool for Intelligent Conceptual Mod- 
elling. In 7th Inti. Workshop on Knowledge Representation meets Databases 
(KRDB’00), August 2000. 

9. William H. Inmon. Building the Data Warehouse. John Wiley & Sons, 2nd edition 
edition, March 1996. 

10. Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, and Panos Vassiliadis. Fun- 
damentals of Data Warehouses. Springer Verlag Berlin Heidelberg, 2000. 

11. Peter McBrien and Alexandra Poulovassilis. A Formal Framework for ER Schema 
Transformation. In Conceptual Modeling - ER. ’97, 16th Inti. Conference on Con- 
ceptual Modeling, Los Angeles, California, USA, November 3-5, 1997, Proc., vol- 
ume 1331 of Lecture Notes in Computer Science, pages 408-421. Springer, 1997. 

12. Christine Parent and Stefano Spaccapietra. Issues and Approaches of Database 
Integration. CACM, 41 (5): 166-178, 1998. 

13. James E. Rumbaugh. OMT: The Object Model. Journal of Object-Oriented Pro- 
gramming, 7(9):21-27, February 1995. 




Object-Oriented Database as a Dynamic System 
with Implicit State 



Kazem Lellahi 1 and Alexandre Zamulin 2 * 

1 LIPN, UPRESA 7030 C.N.R.S 
Universite de Paris 13, Institut Galilee 
99, Av. J.B. Clement, 93430 Villetaneuse France 
kl@lipn.univ-parisl3.fr, fax +33 (0)1 4826 0712 
2 A.P. Ershov Institute of Informatics Systems 
Siberian Division of Russian Academy of Sciences 
Novosibirsk 630090, Russia 
zam@iis.nsk.su, fax: +7 3832 323494 



Abstract. A formalization of object-oriented database concepts in the 
context of algebraic specifications with implicit state is proposed. An ob- 
ject database schema is represented as a dynamic system and an object 
database instance as a state algebra. The paper also provides a formal- 
ization of binding modes and a rigorous treatment of null value. 

Keywords: object modeling, object-oriented database, dynamic system, 
implicit state, state update. 



1 Introduction 

Object-oriented systems deal with collections of objects. An object has a state 
and a behavior. The behavior of an object serves to access or update its state. 
Objects are organized in an inheritance hierarchy. Another important aspect of 
object modeling is overloading , which is the possibility of giving the same name 
to several attributes or methods. 

The need for formal definition of these and other pertinent object concepts is 
widely recognized. Papers and books devoted to this problem include database 
approaches [2,15,5,13,14], conventional algebraic approaches [2,9], and model- 
based approaches [6,22]. The database approaches have resulted in the proposal 
of the 0DMG object model and the object definition language 0DL and object 
query language 0QL based upon it [4,5]. However, the model is not formalized. 

A formal object model FOR (Functional-Object-Relational) has been intro- 
duced in [17] and further developed in [24,16]. The main characteristic of the 
model is its similarity with relational model and relational algebra, that is, a 
clear separation between database schema, database instance and query. How- 
ever, that model does not deal with an important aspect of object databases, 
namely updating methods. This paper stresses on updating aspects. 

* The work of the second author is supported in part by Russian Foundation for Basic 
Research under Grant 01-01-00787. 



A. Caplinskas and J. Eder (Eds.): ADBIS 2001, LNCS 2151, pp. 239—252, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




240 



K. Lellalii and A. Zamulin 



Another interesting approach is adopted in collection types [3,8,7]. The main 
idea of this approach is to enrich pure functional programming techniques with 
facilities for manipulating collections of data. It does not, however, address the 
problem of state transformation and does not introduce any concept of object. 
Moreover, since only total functions are allowed it this approach, it does not 
clearly treat null values that are quite typical in practical database applications. 

Thus, no formal object-oriented data model with the same authority as the 
relational model and no object algebra so elegant as the relational algebra has 
yet emerged. In all likelihood, this is explained by the fact that all previous 
formalization attempts were mainly based on conventional algebraic specifica- 
tion technique. However, as it is noted in [21], the algebraic framework so far 
has been inadequate in describing the dynamic properties of objects and their 
state transformations as well as more complex notions typical of object-oriented 
paradigm such as object identity. A way for modeling these features is proposed 
in some approaches based on the notion of implicit state [12,10,11,25]. This tech- 
nique seems to be very convenient for representing updating methods. Partial 
functions used in conventional algebraic specification approaches (for example, 
in the specification language CASL [19]) seem to be a convenient way for dealing 
with null values. We are going to use these techniques to solve both problems. 

In general, the present paper proposes a new formalism for describing the 
main aspects of objects represented in the ODMG object model [4]. We struc- 
ture objects into classes within a hierarchy with a formal definition of schema. 
However, our definition of schema takes into account all static and all dynamic 
aspects of objects as well as the concepts of inheritance, object identity and 
overloading. Attributes and methods are treated semantically in a uniform way 
as partial functions. This allows us to give a uniform rigorous formalization of 
null values, undefinedness, and binding modes. This also allows us to define a 
database schema in a way similar to an algebraic specification of data types 
and to view a database instance as a state algebra of that specification. Basic 
operations updating the state are formally defined by what we call an applicable 
update set. Such a set consists of a set of parallel actions operating on a valid 
state and providing a valid state. Each action creates or destroys an object or 
updates an attribute. In a database context an applicable update set can be 
regarded as a transaction. 

The rest of the paper is organized as follows. In Sect. 2 we present our formal 
object model. Section 3 is the core of the paper. We define there a database state 
as a state algebra and we present a new formalism for describing state updates. 
In Sect. 4 we define elementary expressions of the object model, and we draw 
some conclusions and indicate some further work in Sect. 5. 

2 The Object Data Model 

The Type System: The model is based on a type system with the following 
grammar: 

T ::= BASE | CLASS | rec IDENT : T, ..., IDENT : T end | set T (1) 




Object-Oriented Database as a Dynamic System with Implicit State 241 



where BASE and CLASS are two disjoint non empty sets of names representing 
basic types and class names , respectively, IDENT is a set of identifiers, and rec 
and set are two type constructors serving for the creation of record types and 
set types, respectively. Elements of T are type expressions (or types). We say 
that the type expression t contains the class name c if one of the following holds: 

— t is c, 

— t is rec p\ : t\, ..., p n : t n end and at least one of ti contains c, 

— t is set 1 1 and t\ contains c. 

In the sequel, T* stands for the set of all sequences of elements of T and T“ 
for T U {void} where void is a special type whose semantics will be explained 
later. It is assumed that BASE and CLASS are endowed with a syntactic equality 
’=’. This equality is extended to T, T* and T u in an obvious way. An element 
r of T* is either the empty sequence or has the form t\- ■ -t n , where n > 0 and 
ti € T for all i (1 < i < n). 

The Database Schema: Roughly speaking, an object database schema is 

a set of class definitions. However, providing a formal definition of the object 
schema is not an obvious task. This is mainly caused by interference between 
inheritance hierarchy and overloading facilities. Let us have two enumerable non- 
empty disjoint sets ATT (attribute names) and METH (method names) having no 
common elements with BASE and CLASS. 

An object database schema (or schema in short) S then consists of : 

— a finite non-empty subset C of CLASS, 

— a binary acyclic relation isa over C, and 

— three partial functions att : C x ATT T, con : C x T* — > {void} and 
meth : C x METH xT*->? 

such that: 

— for every c £ C at least one of the three sets, isa(c) = {d \ c isa d}, 
att(c) = {(a, t) | att(c,a) = t} and methjc) = {(m,r, f)| meth(c,m,r) = t}, 
is not empty; and 

— for every c £ C, if the set con(c) = {r| con(c,r) = void} is not empty, then 
att(c) is not empty, too; and 

— for any type t occurring in the domain or in the range of functions att , con 
or meth and containing a class name c, c £ C . 

The relation isa defines a hierarchy over C. We say c inherits d whenever c isa d . 
For each class name c in C, the tuple (c, isa(c ) , att(c ) , con(c ) , meth(c)) is called 
a class definition in S with name c. A class definition thus has a unique name. 
A pair (a, t) in att(c) is called the declaration of an attribute of c with name 
a, and type t. Similarly, an r in con(c) is called the declaration of a constructor 
of c with rank r, and a triple (to, r, t) in method) is called the declaration of a 
method of c with name to, rank r, and type t. The pair (r, t ) is the profile of 
the method. Informally, constructors serve for object initialization and methods 




242 



K. Lellalii and A. Zamulin 



represent parameterized computations or updates over objects. The result type 
of the latter is void. 

Some remarks explaining the above definition are in order at this point. 

1) The functionality of att and meth indicates: attribute overloading is not al- 
lowed within a class while method overloading within a class is allowed provided 
that they do not have the same rank. Therefore, an attribute (a, t) and a method 
(to, r, t) in a class c should be respectively regarded as (c, a, t) or (c, to, r, t) in the 
whole schema. However, we identify an attribute by its name where no confusion 
is possible. 

2) The first condition of the definition requires that a class has at least one at- 
tribute or one method, otherwise it inherits at least one class. 

3) The second condition requires that a class has at least one attribute if it has 
a constructor. 

4) The last condition of the definition requires that an attribute or a method 
does not refer to a class name outside of the schema. 

The Subtyping Relation: Each schema generates in a natural way a partial 
order over types. Indeed, the acyclicity of isa implies that its transitive and 
reflexive closure is a partial order over C . This partial order is called the inher- 
itance relation and is denoted by <j SO , We read c <i sa d as c is a subclass of c' 
or d is a superclass of c. The partial order <i sa is extended to a partial order 
<T over T in the following way: 

— if b is a basic type then b ~<r b, and if c <i sa d then c ~<T d; 

— if t <r d then ( set t) <t (set t'); 

— if <i -<T t'i, ...,tk <r t' k and k < n, then 

rec pi : ti, ...,p n ■ t n end <T rec p 1 : t\, ...,p k : t' k end. 

We read t -<T t' as t is a subtype oft' or t' is a supertype oft. The partial order 
-<t is extended to a partial order on T* and to a partial order -<ff on T u 
by setting void void and t\ ■ ■ ■ t n ■ ■ ■ t' k if either both sequences are 

empty or k = n and ti d j- t' i for all i, 1 < i < n. In the sequel, we refer to any of 
these ordering relations as a subtype relation , and we omit the subscripts where 
they are understood from the context. 

Hierarchical Consistency (Overloading and Overriding Problems): 

Roughly speaking, overriding consists of hiding an attribute or a method of a 
class in a subclass by redefining it. More precisely, 

we say that an attribute (a, t!) of a class d is overridden in a class c 
if c <isa d and there is an attribute (a, t ) in c. Similarly, a method 
to : r' — > t! of a class d is said to be overridden in a class c if c <j SO d 
and there is a method to : r — > t in c such that r' d j- r. 

This overriding concept is similar, but more general, to that of programming 
languages JAVA [20] and C++ [23]. In these languages overriding is strong in the 
sense that the two methods, in the class and in the subclass, must have the 




Object-Oriented Database as a Dynamic System with Implicit State 243 



same profile. We allow the domain of an overriding method to be a supertype of 
the domain of the overidden method. This phenomena is known under the name 
contravariance of parameter types (see [1] for details) . The problem of overriding 
is a delicate problem in the object-oriented paradigm since overridden methods 
can cause inconsistency during binding, and our definition of schema can give 
way to such inconsistency. To avoid it, some conditions must be satisfied. 

Definition 1 (covariance) The database schema S is said to satisfy the co- 
variance condition iff 

1. for every attribute (a, if) in a class c', if it is overridden in a class c as (a, t) 
then t = t' , and 

2. for every method (m. d . t r ) in a class d, if it is overridden in a class c as 

(to, r, t) then t <fj- 1' . ■ 

Thus, an attribute type cannot be changed in a subclass, but the type of a 
method can be changed to its subtype. Attribute overriding as it is defined is 
just a possibility to define the same attribute in a subclass, which can be useful 
for avoiding clashes in case of multiple inheritance. Covariance can be checked 
automatically. 

Schema Closure: It can be easily seen that the model supports multiple inher- 
itance. Indeed, apart of acyclicity, there is no other restriction on isa. Thus, an 
attribute name (a method name) can occur in several superclasses of a class c 
with the same or different type (profile) . The problem is which of these attributes 
(methods) must be inherited in c ? The answer is the attribute (method) that is 
in the lower superclass of c provided that such a lower superclass exists. There- 
fore, we need another condition: the minimal condition. Formally, we define 
super* (c, a) = {d \ c<i sa d A (3t € T attfd, a) = t)}, and 
super*(c,m,r) = {cf \ c <i sa d A ( 3t £ T u meth(d ,m,r) = t)}. 

An element of super* (c, a) is either c or a superclass of c in which a is the name 
of an attribute. An element of super* (c, m, r) is either c or a superclass of c in 
which in is the name of a method with parameter types r. 

Definition 2 (minimal condition) We say that a schema S satisfies the min- 
imal condition if the following holds: 

— For every class name c, attribute name a, method name m and sequence of 
types r occurring in S, each of two sets super* (c, a) and super* (c, to, r) has 
a minimum whenever it is not empty. 

These minimums are denoted by ResA(c,a) and ResM(c,m,r) and called the 
resolution of a in c and the resolution of to in c with respect to r, respectively. ■ 

Minimal conditions can be checked automatically. 

If ResA(c,a) = d, then there is an attribute (a, t) in d and c < Tc d . If d = c, 
then (a, t) is explicitly defined in c, that is att(c, a) = t. Otherwise, att{c , a) is not 
defined explicitly, but (a, t) may be treated as an implicit (inherited) attribute 




244 



K. Lellalii and A. Zamulin 



of c. Thus, we can extend att to (c, a) by defining att(c, a) = t. In a similar way, 
if ResM{c,m.,r) = c', then there is a method rri : r t’ in d . Then, either 
c = d and m : r — > if is an explicit method in c, that is meth(c, to, r) = t', or 
to : r — > t' is an inherited method in c and we can define meth(c,m,r) = t! . 
Therefore, when the schema satisfies the minimum condition, the following rules 
extend the functions att and meth to all inherited attributes and methods: 

ResA(a,c) — d att(c',a) = t ResM(m,c,r) = d meth(d ,m,r) = t' ^ 
att(c, a) =t meth{c, m, r ) = t' 

It is not difficult to see that S = (C, isa , att , con, meth ) is a database schema, 
which we call the closure of S under inheritance. Indeed, the schema S is ob- 
tained from S by adding all inherited attributes and inherited methods, using 
rules 2. According to these rules, att and meth are partial functions and no new 
class name is added to S. Moreover, no new parameter type can be created, 
therefore (5) = S. Note that constructors are not inherited. 

The above discussion shows that a well-defined schema should satisfy covariance 
and minimality conditions. Such a schema is called hierarchically consistent. 
Hierarchical consistency guarantees that inheriting and binding will not cause 
any ambiguities. 

Theorem 1 ([16]) If S is a hierarchically consistent schema, then so is S. 

We consider only a hierarchically consistent schema in the sequel. 

3 Algebra and Database 

Basic Algebra: A basic algebra B associates with each type t a set B t , its carrier, 
and with each operation of t a partial function, its implementation. The carriers 
of our types are defined recursively as follows: 

— The carrier of each basic type t is an enumerable set B t . The carriers of basic 
types are assumed to be pairwise disjoint. 

— The carrier of each c £ C is the set Oid, which is supposed to be a special 
set disjoint from all basic types carriers. 

— The carrier of set t is V± (B t ), the set of finite subsets of B t . 

— The carrier of rec pi : t\, ...,p n ■ t n end is the set of tuples (vi, . . . , v n ) so 
that at least one of is in Bi and any other Vj can be T where T is a special 
value that is neither an element of a basic type nor an element of Oid. 

Values of basic types are called observable values, those of Oid non-observable 
values or object identities. The special value T is called the null value denoted 
also by NULL in the sequel. 

We assume that each basic type is endowed with some operations, and each 
of them is implemented as a partial function op A : B tl x • • • x B tn — > B t when 
n > 0, otherwise op B £ B t . Each set type is assumed to be endowed with usual 
set operations. A record type r = rec pi : t\, ...,p n : t n end possesses a number 




Object-Oriented Database as a Dynamic System with Implicit State 245 



of partial projection operations pi : r ti and a record construction operation 
rec : t ly ...,t n r 
mapped in B to a partial function 
rec B : x ... x B^ — > B r , 

where Bj. = B tl U {_L}, so that pf(rec B (vi, ..., v n )) = v ± if is not _L and it is 
undefined otherwise. Note that the result of a projection function is undefined 
rather than NULL if a record is projected to a field with value NULL. This helps 
us to avoid typing problems which are inevitable if NULL is used in expressions. 

The equality operation for two records, rq and r 2 , of the same type r = 
rec pi : 1 1 , ...,p n : t n end is defined as follows: ri = r 2 iff, for each i = 1, • • • , n, 
either both pi(ri) and Pi(r 2 ) are defined and Pi(ri) = Pi(r 2 ) or both Pi(ri) and 
Pi(r 2 ) are not defined. 

The only predefined operation of c £ C is the comparison operation 
such that o = o' is true iff both o and o' are the same element of Oid. 

State Algebra: A state algebra represents a database state. Let B be a basic 
algebra, S = ( C , isa, att , con, meth) a database schema, and S its closure under 
inheritance. 

A state algebra A, over S is created in the following way: 

1. A finite partial identity function, id£ c : Did — > Did, is associated with each 
c £ C so that id cc (o) and id c > c >(o) are both defined iff there is c" such that 
c" <isa c and c" <, sa d and id c » c "(o) is defined. 

We denote by A c the range of the function id^ c . For a basic type t, we define 
A t = B t and we then expand A t to any type t created with the use of the 
type constructors. 

2. A partial function a£ t : A c — > A t is associated with each attribute (a, t) £ 
att(c) so that, for each pair (c, c'), if c <i sa d and o £ A c , then 
a£ t (o) = a£, t (o). Such a function is called an attribute function. 

Thus, each subclass inherits part of its superclass attribute functions. In the 
sequel, the set A c is called the extent of c in the state A. Each o in A c is called an 
object identity of the class c. Note that the set of object identities of a superclass 
includes object identities of its subclasses. Indeed, if c is a subclass of d then 
in clause 1 above c" is c. Thus, if id cc (o) is defined so is id c ' c '(o ), in other 
words A c C A c /. Therefore, the semantics of inheritance in a state algebra is set 
inclusion. 

If o £ A c , then a^ t (o) is an attribute of o. An object is a pair (o, obs) where 
o is an object identity and obs is the tuple of its attributes called object’s state. 
We write sometimes “object o” meaning the object with the identity o. If id^ c (o) 
is defined, we say that c is a type of o. Furthermore, if there is no d <i sa c such 
that id£, c , (o) is defined, we say that c is the most specific type of o. An object o 
is a proper object of a class c iff it is not in any subclasses of c. In practice, some 
classes are generic and may not have proper objects (because either no attribute 
is defined or inherited in the class or some its methods are not implemented). 
Such a class is called an interface in [4] . For simplicity, we do not make difference 
between classes and interfaces in this paper. 




246 



K. Lellalii and A. Zamulin 



Several state algebras over a database schema S can have the same base 
algebra. Following the notation of [10], we denote the set of all state algebras, 
with the same base B, by state B (<S) and mean by a <S B -state a state over S with 
the basic algebra B. 

We say that a value v of type t contains a value (object identity) o of class 
c in the state A if one of the following holds: 

— t is c and v is o, 

— t is rec p\ :t\, ...,p n '■ t n end and at least one of p*(v) contains o, 

— t is set t\ and at least one w £ v contains o. 

To define the interpretation of method names, we firstly need to introduce a 
formal notion of state update. 

State Update: One state can be transformed into another by a state update. 

Definition 3 A state update in a <S>B-state A is a triple (f ct ,o,v) where f ct is 
either an attribute symbol of type t in class c or the identity function symbol 
id cc , and the other two elements are the following: 

— for an update (id cc , o, o), the object identity o is not in A c j, for any cl £ C, 

— o £ A c and v is either an element of A t or the symbol _L in all other cases. ■ 

Note that f ct in a triple (f ct , o, a) is actually a function symbol, i.e. , a function 

name qualified with its profile, which helps us to avoid ambiguity when attribute 
names are overloaded in different classes. 

A state update a = (f ct , o, a) serves for the transformation of a 5s-state A 
into a new algebra Aa in the following way: 

— g^/ is the same as g*, t , for any g c / t / different from f ct ; 

— (o) = a if a is not _L, f*“(o) becomes undefined otherwise; 

— f*“(o') = f* t (o') for any o' different from o. 

Following Gurevich [12], we say that Aa is obtained by firing the update a on 
the state A. Roughly speaking, firing a state update either inserts an element 
into the definition domain of an attribute function a ct or modifies the value of 
such a function at one point in its definition domain or removes an element from 
the definition domain. The state update (id cc ,o,o) in fact extends the set of 
object identities of a class c by a new element o and the state update (id cc , o, _L) 
contracts the set of object identities of a class c. 

Definition 4 A set T of state updates is inconsistent if it contains 

— two state updates ai = (f c t, o, v) and 02 = (fct, o, v') s.t. v ^ v' (two state 
updates defining an attribute function differently at the same point), or 

— an ai = (a ct , o, v) and a 2 = (id c / c /, o', _L) such that either c' is c and 0 = 0 ' 
or t contains d and v contains o' (an object is removed from a class extent 
while an attribute function is forced to use it); 



the update set is consistent otherwise. 




Object-Oriented Database as a Dynamic System with Implicit State 247 



A consistent update-set r applied to a <S B -state A transforms it into an algebra 
A' by simultaneous firing all a £ T. If T is inconsistent, A' is not defined. If r is 
empty, A' is the same as A. Following [18], we denote the application of T to a 
state A by Ar. 

Definition 5 The sequential union of two consistent update-sets Ti and r 2 , 
denoted by Ti;r 2 , is a consistent update set created as follows: 

— delete from Ti Ur 2 any ai € b for which there is an a 2 £ r 2 , such 

that {ai,a 2 } is inconsistent, and any a 2 = (id cc , o, _L) £ r 2 if an = 
(id cc ,o,o) e rv ■ 

The above definition of a consistent update set permits us to check whether 
there are internal contradictions in an update set. However, a consistent update 
set can contradict the state to which it should be applied. Therefore, a notion 
of an applicable update set is needed. 

Definition 6 A consistent update set T is applicable to a state A if either T does 
not contain an update of the form (id cc , o, _L), or for each (id cc , o, _L) ST: 

— either there is no a*, t such that, given an o' £ A c ', the attribute a^, t (o') 
contains o or (a c / t , o', _L) £ r (either the object to be deleted not referenced 
by any other object or all such references are deleted in this update set) ; 

— either a£ t (o) is not defined for any (a,t) £ att(c) or (a ct , o,_L) £ T (either all 

attributes of this object are not defined in A or they are made undefined in 
this update set). ■ 

According to the above definition, we do not allow an automatic cascaded dele- 
tion of objects. If such a deletion is needed, it should be programmed in a 
method. Unfortunately, it is not sufficient to have an applicable update set to 
produce a 6>B-state. The problem is caused by inheritance. For example, firing 
an a = (id cc , o, o) in an <S B -state A updates id* c but does not change id*, c , for 
all d , c <i sa d . Thus, to guarantee that a state transformation produces a state 
algebra, we define a closure of an update set. 

Definition 7 For a consistent update set F, the closure T of F is constructed as 
follows: 

— for each (id cc , o, o) £ F, insert (id c / c /, o, o) in T for all d such that c <i sa d ; 

— for each (id cc ,o,_L) £ T having c" as the most specific type of o, insert 
(id c / c /, o, _L) in T for all d such that c" <i sa d. 

— for each (a ct ,o,v) £ T having c" as the most specific type of o, insert 

(a c / t ,o,v) in T for all d such that c" <i sa d and (a, t) £ att(c'). ■ 

The first clause of the above definition guarantees that inserting an object in a 
class we also insert it into all superclasses of this class. The second clause guar- 
antees that deleting an object from a class we also delete it from all subclasses 
in which it exists. The third clause guarantees that an update of an attribute 




248 



K. Lellalii and A. Zamulin 



function for an object of a certain class will affect the corresponding attribute 
function of all classes that are supertypes of the most specific type of the object. 



Fact 1 The closure T of a consistent update set r is well defined and is consis- 
tent. Moreover, if r is applicable to a state A so is T. ■ 

We say a consistent update set F is closed if T = T. 

Fact 2 1) for any state algebra A and any closed applicable update set T, Ar is 
a state algebra. 

2) If Ti and r 2 are closed applicable update sets, so is IT;r 2 . 

3) For any state A and applicable update sets Ti and r 2 , A(lT;r 2 ) = (AIT)r 2 . ■ 

The set of all closed applicable updates sets in a <S B -state A is denoted by 
update A (<S). It serves for the definition of the semantics of the type void in 
the sequel, i.e., A void = update A (S). We consider only closed consistent update 
sets from now on. If there is no possibility of confusion, we simply write an up- 
date set in the sequel. 

Database: In practice, we are interested only in a part of state B (<S) called valid 
states. A valid state is normally a state satisfying a set of constraints (or axioms) 
specified within the schema. In an object-oriented database the only admissible 
updates are those produced by an updating method. Specifying the constraints 
and checking the validity of a state is an important and sophisticated problem 
which is out of the scope of this paper. Formally, it is sufficient to assume that 
the application of any admissible update to a valid state produces a valid state. 
Thus, without loss of generality we can suppose that all states are valid. 

In the definitions that follow, for any state algebra A and any sequence of 
types r = tx ■ ■ ■ t n , we denote A tl x • • • x A tn by A r which is a singleton set if 
n = 0. 

Definition 8 A database DB(B) consists of: 

1. A subset |DB (B) | of state B («S) called the carrier of DB (B) , 

2. for each c £ C, r £ con(c) and A £ |DB(B)|, a partial function 
C cr • A c x A r ■' A void , and 

3. for each c £ C, ( m,r,t ) £ meth{c) and A £ |DB(B)|, a partial function 

m c r : Ac x A r — > A t such that if d <, sa c and m : r — > t is inherited in d from 
c, then m^ r (o, v) = m*, r (o, v) for each (o, v) £ A c / x A r . ■ 

Since constructors are not inherited, the clause 2 says that constructors in a 
subclass are different from constructors in its superclasses. Since an inherited 
and an overridden method in a class can not have the same profile, functions m* r 
in clause 3 are safe. This clause says that if a class inherits some method from a 
superclass, then both superclass objects and subclass objects are supplied with 
the same method. If a subclass overrides a method of its superclass, its objects are 




